Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
creating:jlists [2025/04/21 20:21] – Created page with introduction ahelwer | creating:jlists [2025/04/27 21:57] (current) – Finished first draft ahelwer | ||
---|---|---|---|
Line 14: | Line 14: | ||
/\ B | /\ B | ||
/\ \/ C | /\ \/ C | ||
- | \/ D | + | \/ D |
</ | </ | ||
Here, ''/ | Here, ''/ | ||
- | If a set of conjuncts or disjuncts are vertically aligned (have the same start column) then those juncts are grouped together. | + | If a set of conjuncts or disjuncts |
So, we want to parse the above example as '' | So, we want to parse the above example as '' | ||
This is a very nice language feature, but parsing these lists properly is the greatest challenge we've yet faced. | This is a very nice language feature, but parsing these lists properly is the greatest challenge we've yet faced. | ||
It is only a slight exaggeration to say the purpose of this entire tutorial series is to teach you how to parse these jlists. | It is only a slight exaggeration to say the purpose of this entire tutorial series is to teach you how to parse these jlists. | ||
Let's begin! | Let's begin! | ||
+ | |||
+ | ====== A First Attempt ====== | ||
+ | |||
+ | Before we increase the complexity of our parser, let's convince ourselves it is necessary. | ||
+ | We will attempt to parse jlists using familiar recursive descent techniques, then see where it goes wrong. | ||
+ | At first this seems like it should be possible! | ||
+ | Our jlists resemble the set constructor '' | ||
+ | There isn't really a terminating token but that does not seem to be an obstacle. | ||
+ | So, try adding this to the '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | return new Expr.Variadic(operator, | ||
+ | } | ||
+ | |||
+ | if (match(AND, OR)) { | ||
+ | Token op = previous(); | ||
+ | List< | ||
+ | do { | ||
+ | juncts.add(expression()); | ||
+ | } while (matchBullet(op.type, | ||
+ | return new Expr.Variadic(op, | ||
+ | } | ||
+ | |||
+ | throw error(peek(), | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Remember our special '' | ||
+ | Now we put it to use in a new helper function for the '' | ||
+ | <code java> | ||
+ | private boolean matchBullet(TokenType op, int column) { | ||
+ | if (peek().type == op && peek().column == column) { | ||
+ | advance(); | ||
+ | return true; | ||
+ | } | ||
+ | |||
+ | return false; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | '' | ||
+ | In this way our jlist parsing logic will only add another expression if it finds another vertically-aligned ''/ | ||
+ | This all... kind of works? | ||
+ | You can parse a surprisingly broad set of jlists with just our simple logic! | ||
+ | Try it out; here are some jlists that can now be parsed: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ 2 | ||
+ | </ | ||
+ | <code haskell> | ||
+ | op == | ||
+ | \/ 1 | ||
+ | \/ 2 | ||
+ | </ | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ \/ 2 | ||
+ | \/ 3 | ||
+ | </ | ||
+ | Make them as nested and wild as you like, our code will handle them! | ||
+ | So are we done? | ||
+ | Unfortunately not. | ||
+ | Here's the proverbial wrench: TLA⁺ also allows '' | ||
+ | We skipped defining them in the expressions chapter where we learned how to parse operators of different precedences; | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | private static final Operator[] operators = new Operator[] { | ||
+ | new Operator(PREFIX, | ||
+ | new Operator(PREFIX, | ||
+ | new Operator(PREFIX, | ||
+ | new Operator(INFIX, | ||
+ | new Operator(INFIX, | ||
+ | new Operator(INFIX, | ||
+ | </ | ||
+ | |||
+ | Just like that, our jlist parsing code no longer works. | ||
+ | '' | ||
+ | So, a jlist like: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ 2 | ||
+ | /\ 3 | ||
+ | </ | ||
+ | is parsed as a jlist with a single conjunct, the expression '' | ||
+ | This is awful! | ||
+ | How can we fix it? | ||
+ | |||
+ | ====== Beyond Context-Free ====== | ||
+ | |||
+ | The answer to our parsing problem is deceptively simple: before consuming a ''/ | ||
+ | If so, we prioritize treating the ''/ | ||
+ | |||
+ | For readers who have taken a computer science class in formal languages, alarm bells should be going off - changing the parse behavior depending on the current context is a //big// change! | ||
+ | In theoretical terms, our parser is currently // | ||
+ | The context - or surrounding code being parsed - does not change how '' | ||
+ | However, if we want to parse jlists correctly we will need to break this limitation and move to the more powerful // | ||
+ | |||
+ | What does this change mean in practical terms? | ||
+ | First, it means we cannot easily write the formal TLA⁺ grammar in Backus-Naur Form (BNF) as we learned to do in [[https:// | ||
+ | Although grammar notation [[https:// | ||
+ | Thus formal TLA⁺ grammars exclude jlists and use BNF to define the non-jlist parts of the language, then use plain language to describe how jlists work. | ||
+ | Second - returning to our parser implementation here - it means that our '' | ||
+ | |||
+ | Ultimately the shape of this state is a stack of nested jlists. | ||
+ | Each entry in the stack records the type of jlist (conjunction or disjunction) and the column of its vertical alignment. | ||
+ | When we start parsing a new jlist, we push its details to this stack. | ||
+ | When we finish parsing a jlist, we pop it from the stack. | ||
+ | The top of this stack is the " | ||
+ | |||
+ | Some of the logic here gets a bit finicky, so let's encapsulate it within a class. | ||
+ | Create a new file called '' | ||
+ | <code java> | ||
+ | package tla; | ||
+ | |||
+ | import java.util.ArrayDeque; | ||
+ | import java.util.Deque; | ||
+ | |||
+ | class JListContext { | ||
+ | |||
+ | private record JListInfo(TokenType type, int column) { } | ||
+ | |||
+ | private final Deque< | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Add an instance of '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | private final List< | ||
+ | private int current = 0; | ||
+ | private final boolean replMode; | ||
+ | private final JListContext jlists = new JListContext(); | ||
+ | </ | ||
+ | |||
+ | In the '' | ||
+ | |||
+ | <code java> | ||
+ | public void startNew(Token op) { | ||
+ | stack.push(new JListInfo(op.type, | ||
+ | } | ||
+ | |||
+ | public void terminateCurrent() { | ||
+ | stack.pop(); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Back in the '' | ||
+ | <code java [highlight_lines_extra=" | ||
+ | if (match(AND, OR)) { | ||
+ | Token op = previous(); | ||
+ | jlists.startNew(op); | ||
+ | List< | ||
+ | do { | ||
+ | juncts.add(expression()); | ||
+ | } while (matchBullet(op.type, | ||
+ | jlists.terminateCurrent(); | ||
+ | return new Expr.Variadic(op, | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Now we write a critical method. | ||
+ | In '' | ||
+ | <code java> | ||
+ | public boolean isNewBullet(Token op) { | ||
+ | JListInfo current = stack.peekFirst(); | ||
+ | return current != null | ||
+ | && current.type == op.type | ||
+ | && current.column == op.column; | ||
+ | } | ||
+ | </ | ||
+ | '' | ||
+ | If we are not currently parsing a jlist, this element will be '' | ||
+ | Otherwise, it checks the current jlist attributes against the given token. | ||
+ | If the token type & column both match, the method returns true. | ||
+ | |||
+ | Where should call this powerful method? | ||
+ | There are actually a few possibilities, | ||
+ | <code java [highlight_lines_extra=" | ||
+ | private Operator matchOp(Fix fix, int prec) { | ||
+ | if (jlists.isNewBullet(peek())) return null; | ||
+ | for (Operator op : operators) { | ||
+ | if (op.fix == fix && op.lowPrec == prec) { | ||
+ | if (match(op.token)) return op; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | return null; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | This works! | ||
+ | We can parse jlists again! | ||
+ | It's a bit tricky to figure out what we actually did, so let's take a close look at our infix operator parsing code in '' | ||
+ | <code java> | ||
+ | Expr expr = operatorExpression(prec + 1); | ||
+ | while ((op = matchOp(INFIX, | ||
+ | Token operator = previous(); | ||
+ | Expr right = operatorExpression(op.highPrec + 1); | ||
+ | expr = new Expr.Binary(expr, | ||
+ | if (!op.assoc) return expr; | ||
+ | } | ||
+ | </ | ||
+ | Consider what happens when trying to parse this TLA⁺ snippet: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ 2 | ||
+ | </ | ||
+ | The parser will: | ||
+ | - look for an expression following '' | ||
+ | - find ''/ | ||
+ | - find '' | ||
+ | - call '' | ||
+ | - in '' | ||
+ | - eventually return '' | ||
+ | - in the jlist loop, call '' | ||
+ | |||
+ | If the call to '' | ||
+ | But we pre-empted it! | ||
+ | So now ''/ | ||
+ | |||
+ | ====== Termination ====== | ||
+ | |||
+ | We're close, but we're not //quite// there yet. | ||
+ | Consider the following TLA⁺ snippet: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ \/ 2 | ||
+ | /\ 3 | ||
+ | </ | ||
+ | Our code parses this as though it's actually: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ \/ (2 /\ 3) | ||
+ | </ | ||
+ | We thought we had defeated the tyranny of infix op ambiguity, but we were wrong! | ||
+ | What happened? | ||
+ | Well, our call to '' | ||
+ | The enclosing ''/ | ||
+ | You might be tempted to fix this by modifying '' | ||
+ | |||
+ | Each junct' | ||
+ | Put another way, if you put a token on the next line to the left of the ''/ | ||
+ | That enables you to do neat things like: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ A | ||
+ | /\ B | ||
+ | = C | ||
+ | </ | ||
+ | which should be parsed as '' | ||
+ | The '' | ||
+ | To check this, we need another helper method in '' | ||
+ | <code java> | ||
+ | public boolean isAboveCurrent(Token tok) { | ||
+ | JListInfo current = stack.peekFirst(); | ||
+ | return current == null || current.column < tok.column; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | We then add a call to '' | ||
+ | <code java [highlight_lines_extra=" | ||
+ | private boolean check(TokenType type) { | ||
+ | if (!jlists.isAboveCurrent(peek())) return false; | ||
+ | if (isAtEnd()) return false; | ||
+ | return peek().type == type; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | What does this do? | ||
+ | Well, within a given call to '' | ||
+ | If an expression is being parsed within the context of a jlist, and it encounters a token on the next line that is not to the right of the current jlist alignment, control flow is ultimately kicked back up to the jlist loop which then calls '' | ||
+ | Parsing can then resume as expected. | ||
+ | |||
+ | There is one other benefit we've unlocked. | ||
+ | It isn't enough to parse valid jlists, we must also reject invalid ones! | ||
+ | The TLA⁺ language specification requires that the parser reject attempts to defeat vertical alignment encapsulation by abusing delimiters like '' | ||
+ | What this means is that inputs like the following should fail to parse: | ||
+ | <code haskell> | ||
+ | op == | ||
+ | /\ 1 | ||
+ | /\ (2 | ||
+ | ) | ||
+ | /\ 3 | ||
+ | </ | ||
+ | Indeed, our parser will detect an error here. | ||
+ | The parentheses parsing logic will call '' | ||
+ | This gives rise to a parse error. | ||
+ | |||
+ | ====== Error Recovery ====== | ||
+ | |||
+ | Talk of parsing errors nicely segues us onto the topic of error recovery. | ||
+ | Recall that on error, we call '' | ||
+ | Jlists complicate this a bit! | ||
+ | What happens if an error occurs while parsing a jlist and we enter '' | ||
+ | Well, nonsensical things happen. | ||
+ | To fix this we just wipe out our jlist stack at the top of '' | ||
+ | <code java [highlight_lines_extra=" | ||
+ | private void synchronize() { | ||
+ | jlists.dump(); | ||
+ | advance(); | ||
+ | |||
+ | while (!isAtEnd()) { | ||
+ | </ | ||
+ | |||
+ | This calls a new helper we'll define in '' | ||
+ | <code java> | ||
+ | public void dump() { | ||
+ | stack.clear(); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Done. | ||
+ | You've successfully parsed vertically-aligned conjunction & disjunction lists in TLA⁺! | ||
+ | This puts you in rarified air. | ||
+ | Only a handful of people in the world possess this knowledge, and now you are among them. | ||
+ | If your code got out of sync during this tutorial, you can find its expected state [[https:// | ||
+ | Continue on the [[creating: | ||
+ | |||
+ | ====== Challenges ====== | ||
+ | |||
+ | Here are a number of optional challenges, in roughly increasing levels of difficulty. | ||
+ | |||
+ | - Python uses indentation to determine statement membership in a code block. Does this make Python context-sensitive? | ||
+ | - It's tempting to summarize this chapter as us solving the jlist parsing problem by making jlists have higher precedence than infix operators, but that is not quite the case. Think carefully about what precedence means; is there a difference between what might be called //lexical// precedence - where one interpretation of a token takes higher precedence than another - and parsing precedence? Did we make use of that here? What are some ways that parsers can deal with the problem of the same token having multiple possible meanings? | ||
+ | - Jlists are not the only context-sensitive language construct in TLA⁺. Nested proof steps are another. Take some time to read the [[https:// | ||
+ | - Write unit tests for your jlist parsing code. Think of every weird jlist case you can. Look at [[https:// | ||
+ | - If you are familiar with the [[https:// | ||
+ | - Most courses in formal languages skip directly from context-free grammars to Turing machines, but this misses a number of automata of intermediate power. See whether it is possible to use [[https:// | ||
[[creating: | [[creating: | ||