Defining languages with sets - A horrible yet correct alternative to grammars

When you want to define a language you usually describe its syntax with a grammar. For example:

term:
    'true'
    | 'false'
    | '[+-]?[0-9]' //Integer number
    | 'plusOne' term
    | 'minusOne' term
    | 'isZero' term
    | 'if' term 'then' term 'else' term

This language includes plusOne 0, plusOne plusOne 0, if isZero minusOne 1 then 42 else -5, etc. The thing is, when I see a grammar I tend to focus on the parse trees it would produce with parser generators like ANTLR4. And the convenience of ANTLR4 makes me forget that the grammar doesn't define the parser, it defines the language (though yeah, it also defines the structure the parse trees will have).

            |
  /    /    |  \   \   \
if isZero then 42 else -5
     |
  minusOne
     |
     1

(Now I'm an ASCII artist. I'll turn this tree into a NFT)

In the first post of my series about making a regex engine, I wrote:

A language is just a set of strings, for example the regex a* defines the language $L = \{\varepsilon, a, aa, aaa, aaaa...\}$ which can also be expressed as $L = \{a^n : n >=0 \}$ .

The same way the regex a* defines a set a grammar also defines one. And the same way you use the regex to build recognizer for $L$ , you use the grammar to build a recognizer (a.k.a parser) for the language it defines.

It's no coincidence that regex and grammars are so similar. Both are methods of defining formal languages, but regex are limited to regular languages, which are the lowest tier of languages in Chomsky's hierarchy. You can even use a regular grammar to define a regular language instead of a regex, but they are more inconvenient.

Defining a language with sets

Let's get to the point of this post. Given the above grammar of term, is it possible to define its language by directly defining the set of the language? Yes, though it's a bit harder than with regular expressions since grammar rules can be recursive. But you can still make an inductive definition:

\begin{array}{cccc} S_0 & = & \phantom{\cup} & \{true, false \} \cup \mathbb{Z} \\ S_{i+1} & = & & \{true, false \} \cup \mathbb{Z} \\ & & \cup & \{\textnormal{plusOne} \ t_1, \textnormal{minusOne} \ t_1, \textnormal{isZero} \ t_1 \ | \ t_1 \in S_i\} \\ & & \cup & \{\textnormal{if} \ t_1 \ \textnormal{then} \ t_2 \ \textnormal{else} \ t_3 \ | \ t_1, t_2, t_3 \in S_i\} \end{array}

$S_0$ includes literals (as a small reminder $\mathbb{Z}$ is the set of integers). $S_1$ includes literals and calls to functions with a literal argument, like plusOne 0, I'll call them simple calls. $S_2$ contains again literals, simple calls and one level nested calls, like plusOne plusOne 0, $S_3$ contains literals, simple calls and two level nested calls, like if isZero minusOne 1 then 42 else -5. You get the idea, each level includes the next level of nested calls.

You might have noticed that every $S_i$ contains the literals ( $\{true, false\} \ \cup \ \mathbb{Z}$ ) and therefore simple calls too. This is necessary because otherwise you couldn't use them in $S_i$ where $i > 1$ . Another approach would be doing something like $L = \{true, false, 0\}$ and in every rule: $t_1 \in S_i \cup L$ , but that would be a bit confusing and I'm not sure if it makes sense mathematically.

Of course, each $S_i$ is just a piece of the cake. The complete language is defined by the union of all these sets:

\begin{aligned} S = \bigcup\limits_{i=0}^{\infty} S_{i} \end{aligned}

Depth and tree structure (part 1)

In a instinctive way I assumed that the set definition would list the strings that are contained in a language but, unlike grammars, it wouldn't define the structure of the parse trees. So it struck me when I realized the values contained in $S_i$ are correlated to the depth of the parse trees those strings would generate.

i	Tree depth
0	1
1	1, 2
2	1, 2, 3
3	1, 2, 3, 4
4	1, 2, 3, 4, 5
5	1, 2, 3, 4, 5, 6

For example, above I (horribly) drawed the tree of if isZero minusOne 1 then 42 else -5. It has a depth of 3, so the first $S_i$ where it appears is $S_2$ . Overall $S_i$ contains all the strings with depth lower or equal than $i+1$ .

The reason that each $S_i$ also contains trees of depth $0,1,2,3,... i$ is that each $S_i$ contains literals, which have depth 1. So $S_2$ contains elements of depth 2 because $S_1$ contains elements of depth 1 (literals). And therefore $S_3$ also contain elements of depth 3. You get the idea. I'll make an attempt for a formal proof of this in a collapsible below.

So what's the point of all of this? That even though the complete set $S$ doesn't contain structure, the different values of $S_i$ have traces of it. And it makes sense, since I defined the sets using the grammar as a base. For example, if instead of:

term:
    ...
    | 'if' term 'then' term 'else' term;

I wrote:

term:
    ...
    | 'if' term 'then' term elseClause;

elseClause: 'else' term;

The set $S$ would have a different definition but its value would be the same. The values of $S_i$ would change to reflect the different definition, but the union of all $S_i$ wouldn't change.

But before we can see this, we need to understand how the definition of the set changes when the grammar has multiple rules.

Defining a grammar with multiple rules as a set

The example above, which I ~~stole~~ extracted from Types and Programming Languages, defines the set given a grammar with a single rule. But grammars usually have more than one. I guess that every grammar could be transformed into a single rule grammar, at the cheap cost of the sanity of the developer, so that's a way of generating the set of the language. But that's boring, I prefer to interpret it in a different way: Each rule defines it's own set, and therefore, it's own sublanguage.

Warning: Everything below here is completely made up by me, so it might not be mathematically correct. Feel free to scream at the screen while I destroy every mathematical rule in existence

Let's use a simple grammar with two rules.

expr:
    expr ('*'|'/') expr
    | expr ('+'|'-') expr
    | ID '(' arguments ')'
    | NUMBER, STRING, BOOL;

arguments:
    expr (',' arguments)+;

This is a bit tricky because there is recursion. So this recursion is also be reflected in the definition of the sets. $E$ will be the set of expressions and $A$ the set of arguments:

\begin{array}{cccc} E_{0,0} & = & \phantom{\cup} & \{\textnormal{NUMBER}, \textnormal{STRING}, \textnormal{BOOL}\} \\ E_{i+1,\ j} & = & & \{\textnormal{NUMBER}, \textnormal{STRING}, \textnormal{BOOL}\} \\ & & \cup & \{\ t_1 \textnormal{*} \ t_2, \ t_1 \textnormal{/} \ t_2, \ t_1 \textnormal{+} \ t_2, \ t_1 \textnormal{-} \ t_2, | \ t_1, t_2 \in E_{i,j}\} \\ & & \cup & \{\textnormal{<ID>} ( \ t_1 \ ) | \ t_1 \in A_{i+1, \ j}\} \end{array} \begin{array}{cccc} A_{0,0} & = & \phantom{\cup} & \emptyset \\ A_{i+1, \ j} & = & & \{ t_1 \ \textnormal{'},\textnormal{'} \ t_2 \ | \ t_1 \in E_i, t_2 \in A_{i+1, j-1} \} \\ & & \cup & \{\ t_1 | \ t_1 \in E_{i, \ j}\} \\ \end{array}

\begin{aligned} E = \bigcup\limits_{i=0 \ j=0}^{\infty} E_{i,j} \end{aligned}

There are several really important details:

Here there are two variables, $i$ and $j$ . This is because I want to separate the length of the arguments set from its depth.
- $A_{2,5}$ contains up to 5 arguments in $E_{1, 5}$ .
- $A_{5, 2}$ contains up to 2 arguments in $E_{4,2}$ .
- I could have used a single variable $i$ to denote both width and depth, but that would break the argument that I plan to do on the next section. See the optional section below for more details.
The definition of $E_{i+1, \ j}$ depends on $A_{i+1, \ j}$ , and at the same time $A_{i+1, \ j}$ depends on $E_{i,\ j}$ . This might look like a circular definition, but the arguments set depends on the previous level of the expression sets. So it's more like a spiral definition 😎.
- For it to work, $E_{i, j}$ must be a subset of $E_{i+1, \ j}$ . Otherwise there could be values that are valid for the language but aren't included in the set. I added a "proof" of this at the end of this section.
In the definition of $A_{i+1, \ j}$ , the $\{\ t_1 | \ t_1 \in E_{i,j}\}$ is the same as just saying $\ \cup \ E_{i,j}$ . So $E_{i,j}$ is a subset of $A_{i+1,\ j}$ . I use this syntax to keep it closer to the grammar definition.
I decided to leave $A_0$ as an empty set, though I guess I could just made $A_0 = E_0$ , but that would be confusing since the overall rule is that $E_i \subset A_{i+1}$ but there's no $E_{-1}$

And as before, the value of $i$ is correlated to the depth of the parse tree. So even though my definition of the sets might be wrong, it doesn't look horribly wrong. I hope...😇

             |
   foo ( ArgumentsNode )                                                <----- E4 (& A5) (The value of every j is 2,
        |                                                                                 so I'll ignore it)
            var ( ArgumentsNode )                                       <----- E3 (& A4)
                        |
                     xz ( ArgumentsNode )                               <----- E2 (& A3)
                                |
                             yw ( ArgumentsNode )                       <----- E1 (& A2)
                                      /   \
                                     3     2                            <----- E0 (& A1)

Optional - More details on why I separated i and j

One of the main points I want to make of this post is that, in the same way two grammars can denote the same language but generate parse trees with different structure, two sets can be equal but have different definitions of each $E_i$ change. So even though $E$ doesn't contain any structure, the values of each $E_i$ show traces of the structure, even if that's not enough to make a parse tree.

And the basis of this hipothesis is that $i$ always changes with depth. Previously I had define $A_i$ like this:

\begin{array}{cccc} A_{0} & = & \phantom{\cup} & \emptyset \\ A_{i+1} & = & & \{ t_1 \ \textnormal{'},\textnormal{'} \ t_2 \ | \ t_1 \in E_i, t_2 \in A_{i+1} \} \\ & & \cup & \{\ t_1 | \ t_1 \in E_{i}\} \\ \end{array}

The problem is that the $i$ doesn't only affect the values of the arguments it can use ( $E_i$ ), but also the number of arguments. Therefore a simple call like foo(1,2,3,4,5,6,7,8) would be in $A_8$ even though all of the arguments are in $E_0$ . By having $E_{i,j}$ the previous call is in $E_{1,8}$ and the $i$ remains unaffected by the number of arguments.

In a way the $j$ also denotes the structure of the parse tree. By allowing an arbitrary number of arguments in the same $i$ it's like if I had define the grammar like this:

arguments:
    expr (',' arguments)+;

While correlating the depth and width in the same $i$ would be like defining it like this:

arguments:
    expr ',' arguments | expr;

In this case the value of $i$ is the same as the depth of the tree. That's because unlike the first example, here you can't write a call like foo(3) till $E_{2,1}$ . But in $S_1$ you could already have plusOne 0. In case you were wondering, $E_{1,1}$ does contains arithmetic calls, like 3 + 2. But since those don't call the arguments rule they aren't interesting for this example, so I casually ignored them.

Optional - Are E_i and A_i subsets of E_i+1 and A_i+1?

The current definition of $A_{i+1, \ j}$ only works in the assumption that $E_{i,j} \ \subset \ E_{i+1, \ j}$ . Otherwise, the fact that $A_{i+1}$ uses values of $E_{i,j}$ would mean that there are values that the set wouldn't accept but that are valid in the grammar.

To check this my first thought was: What happens if I grab a really long nested tree like:

a(                                   )    <----- E8 (& A9) (j = 1 at every level)
    |
    b(                               )    <----- E7 (& A8)
        |
        c(                           )    <----- E6 (& A7)
            |
            d(                       )    <----- E5 (& A6)
                |
                e(                   )    <----- E4 (& A5)
                    |
                    f(               )    <----- E3 (& A4)
                        |
                        g(           )    <----- E2 (& A3)
                            |
                            h(       )    <----- E1 (& A2)
                                 |
                                 3        <----- E0 (& A1)

And I try to add another shorter argument i(j(k(4))):

i(             )    <----- E3 (& A4)  (j = 1 at every level)
    |
    j(         )    <----- E2 (& A3)
        |
        k(     )    <----- E1 (& A2)
            |
            4       <----- E0 (& A1)

Since i(j(k(4))) is in $E_{3,1}$ , adding it as a first argument would create a string that is in the language but not in the set: a(i(j(k(4))), b(c(d(e(f(g(h(3)))))))), right? In an unexpected turn of events, no. Even though the first $i$ in which i(j(k(4))) is contained is $E_{3,1}$ , since every $E_{i,j}$ can access the literals, it means that it's also defined in $E_{4,1}$ :

i(             )    <----- E4 (& A4)  (j = 1 at every level)
    |
    j(         )    <----- E3 (& A3)
        |
        k(     )    <----- E2 (& A2) Note that in the previous tree the k() was in E1, not in E2
            |
            4       <----- E0 (& A1 & A2)

And in $E_{5,1}$ :

i(             )    <----- E5 (& A4)   (j = 1 at every level)
    |
    j(         )    <----- E4 (& A3)
        |
        k(     )    <----- E3 (& A2) Now k() is in E3
            |
            4       <----- E0 (& A1 & A2 & A3)

Even though this isn't a mathematical proof it infers that $E_{i,j} \ \subset \ E_{i+1, \ j} \ , \ A_{i,j} \ \subset \ A_{i+1, \ j}$ for every $i > 0$ .

I'll try a probably wrong attempt at a formal proof using the depth of the trees. For simplicity I'll ignore the function calls, and the $j$ with it. Since $j$ is the maximum number of arguments allowed, and the arguments include every possible combination of $E_{i,j}$ I think it's decently obvious that $E_{i,j} \subset E_{i,j+1}$ (also when I wrote this I still hadn't divided $i$ in $i$ and $j$ and I don't want to rewrite half of the post for the hundredth time).

As a shortcut I'll define td as a tree depth operation and $L = \{\textnormal{NUMBER}, \textnormal{STRING}, \textnormal{BOOL}\}$

Fist let's define how depth increases with every $i$ .

\forall t_x \in E_{i} \ \ td(t_x) = \begin{cases} 0 &\text{if } t_x \in L \\ \text{max} (td(t_1), td(t_2)) + 1 \ | \ t_1, t_2 \in E_{i-1} & \text{if } t_x \not \in L \end{cases}

Then, if $E_i$ contains all the possible values with depth $x$ , then $E_{i+1}$ contains all the possible values with depth $x+1$ . And since $E_0$ contains all the elements of depth 1, then $E_1$ contains all the elements of depth 2.

\begin{aligned} & P(X) = \forall t_1, t_2 \in E_i / \ max(td(t_1),td(t_2)) = d \ \exists \ t_z \in E_{i+1} \ / \ t_z \text{ is formed by } t_1, t_2 \\ & \implies \text{ if } E_i \text{ contains all possible elements of depth } d \text{ then } E_i+1 \text{ contains every element of depth } d+1 \\ & \\ & I(X) = \forall t_i \ / \ td(t_i) = 0, t_i \in E_0 \\ & \\ & P(X) \ \& \ I(X) \implies \forall t_x / \ td(t_x) = 2 \ , \ t_x \in E_1 \\ \end{aligned}

We could keep this going and infer that $E_2$ contains all elements of depth 3, etc. But here's the twist. By definition $E_0$ is a subset of every $E_i$ , and therefore every $E_i$ also contains trees of depth 2:

\begin{aligned} P(X) \ \& \ I(X) \ \& \ \forall \ i \ E_0 \subset E_i \implies \forall i > 0, \ \forall \ t_x / \ td(t_x) = 2 , t_x \in E_i \end{aligned}

And by induction, every $E_i$ where $i > 1$ also contains all the elements of depth 3.

\begin{aligned} P(X) \ \& \ \forall \ t_x / \ td(t_x) = 2 , t_x \in E_i \implies \forall i > 1, t_x / \ td(t_x) = 3 , t_x \in E_i \end{aligned}

The overall conclusion is that each $E_i$ contains all the elements from depth 1 to i+1:

\forall i, t_z / \ td(t_z) \in \{0, 1, 2... i+1\}, \ \ t_z \in E_i

And therefore $E_i \subset E_{i+1}$ and $A_i$ won't miss any value.

(Yes, I really need to review math notation... and math in general. I fell like I'm making everything up)

Depth and tree structure (part 2)

With this ~~random made up theory~~, let's go back to how the way you define the set affects the generated parse tree.

term2:
    'true'
    | 'false'
    | '[+-]?[0-9]' //Integer number
    | 'plusOne' term
    | 'minusOne' term
    | 'isZero' term
    | 'if' term 'then' term elseClause;

elseClause:
    term 'else' term;

I'll call T the set defined by term2 and E the set defined by elseClause. The result is similar to the sets created by expr and arguments, but luckily this one doesn't allow an arbitrary number of arguments, so there is no need for a $j$ .

\begin{array}{cccc} T_0 & = & \phantom{\cup} & \{true, false \} \cup \mathbb{Z} \\ T_{i+1} & = & & \{true, false \} \cup \mathbb{Z} \\ & & \cup & \{\textnormal{plusOne} \ t_1, \textnormal{minusOne} \ t_1, \textnormal{isZero} \ t_1 \ | \ t_1 \in T_i\} \\ & & \cup & \{\textnormal{if} \ t_1 \ \textnormal{then} \ t_2 \ \ t_3 \ | \ t_1, t_2\in T_i \ , t_3 \in E_{i+1}\} \end{array} \begin{array}{cccc} E_0 & = & \phantom{\cup} & \emptyset \\ E_{i+1} & = & & \{\ \textnormal{else} \ t_1 | \ t_1 \in T_i\} \end{array}

\begin{aligned} T = \bigcup\limits_{i=0}^{\infty} T_{i} \end{aligned}

For the input if isZero 1 then 42 else plusOne -5 the generated tree would be:

             |
   if isZero then 42 elseClause             <----- T2
        |                |
        1               else plusOne        <----- E2 (because plusOne -5 is in T1)
                                |
                                -5          <----- T0

With this new grammar the parse tree has a depth of 3, but with the first one it was 2. And accordingly, this string is in $T_2$ but in $S_1$ .

I guess it makes sense since the induction rule (the definition of $S_{i+1} \$ / $\ T_{i+1}$ ) is almost identical to the grammar. It's just a different notation.

Still, even though $S_i$ is correlated to the depth of the tree, it doesn't define the specific structure. For example, it can't differenciate between these two:

         |                    |
      /  |  \              /  |  \
      3  *   +            *   +   1
           /   \        /  \
           2   1        3   2

Both would be in the same $E_i$ but that tells you nothing about which is the correct one.

So after all this long post, it turns out the set definition is not as powerful as the grammar? Not exactly, even though the value of $i$ tells you nothing about the structure, you could still take an arbitrary decision based on how the set is defined:

$\{\ t_1 \textnormal{*} \ t_2, \ t_1 \textnormal{/} \ t_2, \ t_1 \textnormal{+} \ t_2, \ t_1 \textnormal{-} \ t_2, | \ t_1, t_2 \in E_i\}$

The same way ANTLR4 prioritizes the subrules by their order, I could decide that since I wrote the multipication and division first the * has priority. Still, at this point I think I'm rambling too much.

The end

After all this experiments I've realized that defining languages as a set isn't that different as using a grammar. In fact I guess you could write a parser generator that takes these definitions as inputs. Though grammars are way more convenient. And of course there are probably a million things I haven't taken into account.

I guess my next project will be a parser generator from set defintion! Relax, I'm just joking... or am I?