creating:scanning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
creating:scanning [2025/03/29 23:25] – Add link to parsing tutorial, minor wording fixes ahelwercreating:scanning [2025/04/27 17:35] (current) – Fixed link to repo ahelwer
Line 1: Line 1:
-====== Scanning TLA⁺ Tokens ======+======= Scanning TLA⁺ Tokens =======
  
 This page corresponds to [[https://craftinginterpreters.com/scanning.html|chapter 4]] of //Crafting Interpreters// by Robert Nystrom, henceforth referred to as "the book". This page corresponds to [[https://craftinginterpreters.com/scanning.html|chapter 4]] of //Crafting Interpreters// by Robert Nystrom, henceforth referred to as "the book".
Line 5: Line 5:
 For each section in the chapter, first read the section in the book and then read the corresponding section tutorial on this page to see how to adapt the concepts to TLA⁺. For each section in the chapter, first read the section in the book and then read the corresponding section tutorial on this page to see how to adapt the concepts to TLA⁺.
  
-==== Section 4.1: The Interpreter Framework ====+====== Section 4.1: The Interpreter Framework ======
  
-Everything in [[https://craftinginterpreters.com/scanning.html#the-interpreter-framework|section 4.1]] can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar. +Almost everything in [[https://craftinginterpreters.com/scanning.html#the-interpreter-framework|section 4.1]] can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar. 
-You should thus have followed along and arrived at something similar to the below file ''TlaPlus.java'':+We do make one small functional modification: TLA⁺ source files are assumed to be encoded in UTF-8, a variable-width ASCII-compatible encoding, so we specify that when performing the file read. 
 +Here's our main file, ''TlaPlus.java'':
  
-<code java [enable_line_numbers="true",highlight_lines_extra="1,11"]> +<code java [enable_line_numbers="true",highlight_lines_extra="1,6,11,27"]> 
-package com.craftinginterpreters.tla;+package tla;
  
 import java.io.BufferedReader; import java.io.BufferedReader;
 import java.io.IOException; import java.io.IOException;
 import java.io.InputStreamReader; import java.io.InputStreamReader;
-import java.nio.charset.Charset;+import java.nio.charset.StandardCharsets;
 import java.nio.file.Files; import java.nio.file.Files;
 import java.nio.file.Paths; import java.nio.file.Paths;
Line 37: Line 38:
   private static void runFile(String path) throws IOException {   private static void runFile(String path) throws IOException {
     byte[] bytes = Files.readAllBytes(Paths.get(path));     byte[] bytes = Files.readAllBytes(Paths.get(path));
-    run(new String(bytes, Charset.defaultCharset()));+    run(new String(bytes, StandardCharsets.UTF_8));
  
     // Indicate an error in the exit code.     // Indicate an error in the exit code.
Line 79: Line 80:
 </code> </code>
  
-==== Section 4.2: Lexemes and Tokens ====+====== Section 4.2: Lexemes and Tokens ======
  
 The ''TokenType'' class in [[https://craftinginterpreters.com/scanning.html#lexemes-and-tokens|section 4.2]] is our first major departure from the book. The ''TokenType'' class in [[https://craftinginterpreters.com/scanning.html#lexemes-and-tokens|section 4.2]] is our first major departure from the book.
Line 86: Line 87:
  
 <code java [enable_line_numbers="true"]> <code java [enable_line_numbers="true"]>
-package com.craftinginterpreters.tla;+package tla;
  
 enum TokenType { enum TokenType {
Line 92: Line 93:
   LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,   LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
   LEFT_BRACKET, RIGHT_BRACKET, COMMA, COLON,   LEFT_BRACKET, RIGHT_BRACKET, COMMA, COLON,
-  MINUS, PLUS, LESS_THAN, NEGATION, PRIME,+  MINUS, PLUS, LESS_THAN, NOT, PRIME,
  
   // Short fixed-length tokens.   // Short fixed-length tokens.
Line 98: Line 99:
  
   // Literals.   // Literals.
-  IDENTIFIER, NAT_NUMBER, TRUE, FALSE,+  IDENTIFIER, NUMBER, TRUE, FALSE,
  
   // Keywords.   // Keywords.
Line 120: Line 121:
  
 <code java [enable_line_numbers="true",highlight_lines_extra="8,10,15"]> <code java [enable_line_numbers="true",highlight_lines_extra="8,10,15"]>
-package com.craftinginterpreters.tla;+package tla;
  
 class Token { class Token {
Line 143: Line 144:
 </code> </code>
  
-==== Section 4.4: The Scanner Class ====+====== Section 4.4: The Scanner Class ======
  
 Nothing in section 4.3 requires modification, so we can move on to the very important ''Scanner'' class in [[https://craftinginterpreters.com/scanning.html#the-scanner-class|section 4.4]]. Nothing in section 4.3 requires modification, so we can move on to the very important ''Scanner'' class in [[https://craftinginterpreters.com/scanning.html#the-scanner-class|section 4.4]].
Line 149: Line 150:
  
 <code java [enable_line_numbers="true",highlight_lines_extra="1,8,16,29"]> <code java [enable_line_numbers="true",highlight_lines_extra="1,8,16,29"]>
-package com.craftinginterpreters.tla;+package tla;
  
 import java.util.ArrayList; import java.util.ArrayList;
Line 156: Line 157:
 import java.util.Map; import java.util.Map;
  
-import static com.craftinginterpreters.tla.TokenType.*;+import static tla.TokenType.*;
  
 class Scanner { class Scanner {
Line 187: Line 188:
 </code> </code>
  
-==== Section 4.5: Recognizing Lexemes ====+====== Section 4.5: Recognizing Lexemes ======
  
 [[https://craftinginterpreters.com/scanning.html#recognizing-lexemes|Section 4.5]] finally gets around to actually handling text input! [[https://craftinginterpreters.com/scanning.html#recognizing-lexemes|Section 4.5]] finally gets around to actually handling text input!
Line 208: Line 209:
       case '+': addToken(PLUS); break;       case '+': addToken(PLUS); break;
       case '<': addToken(LESS_THAN); break;       case '<': addToken(LESS_THAN); break;
-      case '~': addToken(NEGATION); break;+      case '~': addToken(NOT); break;
       case '\'': addToken(PRIME); break;       case '\'': addToken(PRIME); break;
       default:       default:
Line 284: Line 285:
 </code> </code>
  
-==== Section 4.6: Longer Lexemes ====+====== Section 4.6: Longer Lexemes ======
  
 In [[https://craftinginterpreters.com/scanning.html#longer-lexemes|section 4.6]] we learn how to handle lexemes of longer or unlimited length. In [[https://craftinginterpreters.com/scanning.html#longer-lexemes|section 4.6]] we learn how to handle lexemes of longer or unlimited length.
Line 363: Line 364:
   private void number() {   private void number() {
     while (isDigit(peek())) advance();     while (isDigit(peek())) advance();
-    addToken(NAT_NUMBER,+    addToken(NUMBER,
         Integer.parseInt(source.substring(start, current)));         Integer.parseInt(source.substring(start, current)));
   }   }
 </code> </code>
  
-==== Section 4.7: Reserved Words and Identifiers ====+====== Section 4.7: Reserved Words and Identifiers ======
  
 In [[https://craftinginterpreters.com/scanning.html#reserved-words-and-identifiers|section 4.7]] we learn to lex general identifiers; start by modifying the ''default'' case of the ''scanToken()'' method: In [[https://craftinginterpreters.com/scanning.html#reserved-words-and-identifiers|section 4.7]] we learn to lex general identifiers; start by modifying the ''default'' case of the ''scanToken()'' method:
Line 417: Line 418:
     keywords = new HashMap<>();     keywords = new HashMap<>();
     keywords.put("ELSE",       ELSE);     keywords.put("ELSE",       ELSE);
-    keywords.put("ENABLED",    ELSE);+    keywords.put("ENABLED",    ENABLED);
     keywords.put("FALSE",      FALSE);     keywords.put("FALSE",      FALSE);
     keywords.put("IF",         IF);     keywords.put("IF",         IF);
Line 472: Line 473:
     symbols.put("\\forall",   FOR_ALL);     symbols.put("\\forall",   FOR_ALL);
     symbols.put("\\in",       IN);     symbols.put("\\in",       IN);
-    symbols.put("\\lnot",     NEGATION); +    symbols.put("\\lnot",     NOT); 
-    symbols.put("\\neg",      NEGATION);+    symbols.put("\\neg",      NOT);
     symbols.put("\\lor",      OR);     symbols.put("\\lor",      OR);
   }   }
Line 507: Line 508:
 Isn't it amazing how quickly this is coming together? Isn't it amazing how quickly this is coming together?
 The simplicity of the required code is one of the great wonders of language implementation. The simplicity of the required code is one of the great wonders of language implementation.
 +If you got lost somewhere along the way, you can find a snapshot of the code on this page [[https://github.com/tlaplus/devkit/tree/main/2-scanning|here]].
 Next we learn how to collect our tokens into a parse tree! Next we learn how to collect our tokens into a parse tree!
-Continue the tutorial at [[creating:syntax|Parsing TLA⁺ Syntax]].+Continue the tutorial at [[creating:expressions|Parsing Constant TLA⁺ Expressions]].
  
-===== Challenges =====+====== Challenges ======
  
 Here are some optional challenges to flesh out your TLA⁺ scanner, roughly ranked from simplest to most difficult. Here are some optional challenges to flesh out your TLA⁺ scanner, roughly ranked from simplest to most difficult.
 You should save a copy of your code before attempting these. You should save a copy of your code before attempting these.
 +  - Our error reporting functionality only reports the line on which the error occurs, even though we now also track the column. Modify the error reporting functions to pipe through and print out the column location of the error.
   - Implement token recognition for the ''---- MODULE Name ----''/''===='' header and footer. The ''----'' and ''===='' tokens must be of length four or greater. It can be tricky to gracefully integrate their logic with the existing ''MINUS'' and ''EQUAL''/''EQUAL_EQUAL'' case blocks.   - Implement token recognition for the ''---- MODULE Name ----''/''===='' header and footer. The ''----'' and ''===='' tokens must be of length four or greater. It can be tricky to gracefully integrate their logic with the existing ''MINUS'' and ''EQUAL''/''EQUAL_EQUAL'' case blocks.
   - Modify ''number()'' and ''identifier()'' to properly implement TLA⁺ identifiers, which can consist of any string of alphanumeric or underscore characters as long as at least one character is alphabetical. This corresponds to the regex ''[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*''.   - Modify ''number()'' and ''identifier()'' to properly implement TLA⁺ identifiers, which can consist of any string of alphanumeric or underscore characters as long as at least one character is alphabetical. This corresponds to the regex ''[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*''.
   - Add support for nestable block comments like ''(* text (* text *) text *)''. This requires a much deeper modification of the scanner than might first be apparent. Currently, our lexing grammer is //regular//: it does not need to store any unbounded state to lex everything. However, to properly handle block comments you'll need to add another class field like ''int block_comment_nest_level = -1'' and increment/decrement it as you encounter ''(*'' and ''*)'' tokens. In technical terms this addition makes the lexing grammar context-free instead of regular.   - Add support for nestable block comments like ''(* text (* text *) text *)''. This requires a much deeper modification of the scanner than might first be apparent. Currently, our lexing grammer is //regular//: it does not need to store any unbounded state to lex everything. However, to properly handle block comments you'll need to add another class field like ''int block_comment_nest_level = -1'' and increment/decrement it as you encounter ''(*'' and ''*)'' tokens. In technical terms this addition makes the lexing grammar context-free instead of regular.
   - Similar to nested block comments, add support for extramodular text & nested modules. TLA⁺ files are properly supposed to ignore all text outside of modules, treating it the same as comments. Lexing TLA⁺ tokens should only start after reading ahead and detecting a ''---- MODULE'' sequence. Then, after detecting termination of the module with ''===='', the scanner should revert to ignoring the text. Supporting nested modules complicates this further, since you'll need to keep track of the module nesting level to know when you can start ignoring text again.   - Similar to nested block comments, add support for extramodular text & nested modules. TLA⁺ files are properly supposed to ignore all text outside of modules, treating it the same as comments. Lexing TLA⁺ tokens should only start after reading ahead and detecting a ''---- MODULE'' sequence. Then, after detecting termination of the module with ''===='', the scanner should revert to ignoring the text. Supporting nested modules complicates this further, since you'll need to keep track of the module nesting level to know when you can start ignoring text again.
-  - Add Unicode support. This is surprisingly straightforward. Instead of using the ''char'' type, Java represents Unicode codepoints as an ''int''. So, you'll just be iterating over an array of ''int''s instead of a string. Character literals can still be directly compared against ''int''s; our ''case'' statement should be nearly unchanged. Look at the Java 8 [[https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/CharSequence.html#codePoints()|string.codePoints() method]]. Add support for Unicode symbol variants like ''≜'', ''∈'', ''∧'', ''∨'', ''∃'', and ''∀''.+  - Add Unicode support. Instead of using the ''char'' type, Java represents Unicode codepoints as an ''int''. So, you'll be iterating over an array of ''int''s instead of the characters of a string. Character literals can still be directly compared against ''int''s; our ''case'' statement should be nearly unchanged. Look at the Java 8 [[https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/CharSequence.html#codePoints()|string.codePoints() method]]. Add support for Unicode symbol variants like ''≜'', ''∈'', ''∧'', ''∨'', ''∃'', and ''∀''Our code reads files assuming the UTF-8 encoding so that's already sorted. 
 + 
 +[[creating:start|< Previous Page]] | [[creating:start#table_of_contents|Table of Contents]] | [[creating:expressions|Next Page >]]
  
  • creating/scanning.1743290737.txt.gz
  • Last modified: 2025/03/29 23:25
  • by ahelwer