Let’s Build A Simple Interpreter. Part 14: Nested Scopes and a Source-to-Source Compiler.

Date

Only dead fish go with the flow.

As I promised in the last article, today we’re finally going to do a deep dive into the topic of scopes.

This is what we’re going to learn today:

  • We’re going to learn about scopes, why they are useful, and how to implement them in code with symbol tables.
  • We’re going to learn about nested scopes and how chained scoped symbol tables are used to implement nested scopes.
  • We’re going to learn how to parse procedure declarations with formal parameters and how to represent a procedure symbol in code.
  • We’re going to learn how to extend our semantic analyzer to do semantic checks in the presence of nested scopes.
  • We’re going to learn more about name resolution and how the semantic analyzer resolves names to their declarations when a program has nested scopes.
  • We’re going to learn how to build a scope tree.
  • We’re also going to learn how to write our very own source-to-source compiler today! We will see later in the article how relevant it is to our discussion of scopes.

Let’s get started! Or should I say, let’s dive in!



Scopes and scoped symbol tables

What is a scope? A scope is a textual region of a program where a name can be used. Let’s take a look at the following sample program, for example:

program Main;
   var x, y: integer;
begin
   x := x + y;
end.


In Pascal, the PROGRAM keyword (case insensitive, by the way) introduces a new scope which is commonly called a global scope, so the program above has one global scope and the declared variables x and y are visible and accessible in the whole program. In the case above, the textual region starts with the keyword program and ends with the keyword end and a dot. In that textual region both names x and y can be used, so the scope of those variables (variable declarations) is the whole program:

When you look at the source code above and specifically at the expression x := x + y, you intuitively know that it should compile (or get interpreted) without a problem, because the scope of the variables x and y in the expression is the global scope and the variable references x and y in the expression x := x + y resolve to the declared integer variables x and y. If you’ve programmed before in any mainstream programming language, there shouldn’t be any surprises here.

When we talk about the scope of a variable, we actually talk about the scope of its declaration:

In the picture above, the vertical lines show the scope of the declared variables, the textual region where the declared names x and y can be used, that is, the text area where they are visible. And as you can see, the scope of x and y is the whole program, as shown by the vertical lines.

Pascal programs are said to be lexically scoped (or statically scoped) because you can look at the source code, and without even executing the program, determine purely based on the textual rules which names (references) resolve or refer to which declarations. In Pascal, for example, lexical keywords like program and end demarcate the textual boundaries of a scope:

Why are scopes useful?

  • Every scope creates an isolated name space, which means that variables declared in a scope cannot be accessed from outside of it.
  • You can re-use the same name in different scopes and know exactly, just by looking at the program source code, what declaration the name refers to at every point in the program.
  • In a nested scope you can re-declare a variable with the same name as in the outer scope, thus effectively hiding the outer declaration, which gives you control over access to different variables from the outer scope.

In addition to the global scope, Pascal supports nested procedures, and every procedure declaration introduces a new scope, which means that Pascal supports nested scopes.

When we talk about nested scopes, it’s convenient to talk about scope levels to show their nesting relationships. It’s also convenient to refer to scopes by name. We’ll use both scope levels and scope names when we start our discussion of nested scopes.


Let’s take a look at the following sample program and subscript every name in the program to make it clear:

  1. At what level each variable (symbol) is declared
  2. To which declaration and at what level a variable name refers to:

From the picture above we can see several things:

  • We have a single scope, the global scope, introduced by the PROGRAM keyword
  • Global scope is at level 1
  • Variables (symbols) x and y are declared at level 1 (the global scope).
  • integer built-in type is also declared at level 1
  • The program name Main has a subscript 0. Why is the program’s name at level zero, you might wonder? This is to make it clear that the program’s name is not in the global scope and it’s in some other outer scope, that has level zero.
  • The scope of the variables x and y is the whole program, as shown by the vertical lines
  • The scope information table shows for every level in the program the corresponding scope level, scope name, and names declared in the scope. The purpose of the table is to summarize and visually show different information about scopes in a program.

How do we implement the concept of a scope in code? To represent a scope in code, we’ll need a scoped symbol table. We already know about symbol tables, but what is a scoped symbol table? A scoped symbol table is basically a symbol table with a few modifications, as you’ll see shortly.

From now on, we’ll use the word scope both to mean the concept of a scope as well as to refer to the scoped symbol table, which is an implementation of the scope in code.

Even though in our code a scope is represented by an instance of the ScopedSymbolTable class, we’ll use the variable named scope throughout the code for convenience. So when you see a variable scope in the code of our interpreter, you should know that it actually refers to a scoped symbol table.

Okay, let’s enhance our SymbolTable class by renaming it to ScopedSymbolTable class, adding two new fields scope_level and scope_name, and updating the scoped symbol table’s constructor. And at the same time, let’s update the __str__ method to print additional information, namely the scope_level and scope_name. Here is a new version of the symbol table, the ScopedSymbolTable:

class ScopedSymbolTable(object):
    def __init__(self, scope_name, scope_level):
        self._symbols = OrderedDict()
        self.scope_name = scope_name
        self.scope_level = scope_level
        self._init_builtins()

    def _init_builtins(self):
        self.insert(BuiltinTypeSymbol('INTEGER'))
        self.insert(BuiltinTypeSymbol('REAL'))

    def __str__(self):
        h1 = 'SCOPE (SCOPED SYMBOL TABLE)'
        lines = ['\n', h1, '=' * len(h1)]
        for header_name, header_value in (
            ('Scope name', self.scope_name),
            ('Scope level', self.scope_level),
        ):
            lines.append('%-15s: %s' % (header_name, header_value))
        h2 = 'Scope (Scoped symbol table) contents'
        lines.extend([h2, '-' * len(h2)])
        lines.extend(
            ('%7s: %r' % (key, value))
            for key, value in self._symbols.items()
        )
        lines.append('\n')
        s = '\n'.join(lines)
        return s

    __repr__ = __str__

    def insert(self, symbol):
        print('Insert: %s' % symbol.name)
        self._symbols[symbol.name] = symbol

    def lookup(self, name):
        print('Lookup: %s' % name)
        symbol = self._symbols.get(name)
        # 'symbol' is either an instance of the Symbol class or None
        return symbol


Let’s also update the semantic analyzer’s code to use the variable scope instead of symtab, and remove the semantic check that was checking source programs for duplicate identifiers from the visit_VarDecl method to reduce the noise in the program output.

Here is a piece of code that shows how our semantic analyzer instantiates the ScopedSymbolTable class:

class SemanticAnalyzer(NodeVisitor):
    def __init__(self):
        self.scope = ScopedSymbolTable(scope_name='global', scope_level=1)

    ...

You can find all the changes in the file scope01.py. Download the file, run it on the command line, and inspect the output. Here is what I got:

$ python scope01.py
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: x
Lookup: INTEGER
Insert: y
Lookup: x
Lookup: y
Lookup: x


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>

Most of the output should look very familiar to you.

Now that you know about the concept of scope and how to implement the scope in code by using a scoped symbol table, it’s time we talked about nested scopes and more dramatic modifications to the scoped symbol table than just adding two simple fields.


Procedure declarations with formal parameters

Let’s take a look at a sample program in the file nestedscopes02.pas that contains a procedure declaration:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin
      x := a + x + y;
   end;

begin { Main }

end.  { Main }

The first thing that we notice here is that we have a procedure with a parameter, and we haven’t learned how to handle that yet. Let’s fill that gap by making a quick detour and learning how to handle formal procedure parameters before continuing with scopes.*

*ASIDE: Formal parameters are parameters that show up in the declaration of a procedure. Arguments (also called actual parameters) are different variables and expressions passed to a procedure in a particular procedure call.

Here is a list of changes we need to make to support procedure declarations with parameters:

  1. Add the Param AST node

    class Param(AST):
        def __init__(self, var_node, type_node):
            self.var_node = var_node
            self.type_node = type_node
    
  2. Update the ProcedureDecl node’s constructor to take an additional argument: params

    class ProcedureDecl(AST):
        def __init__(self, proc_name, params, block_node):
            self.proc_name = proc_name
            self.params = params  # a list of Param nodes
            self.block_node = block_node
    
  3. Update the declarations rule to reflect changes in the procedure declaration sub-rule

    def declarations(self):
        """declarations : (VAR (variable_declaration SEMI)+)*
                        | (PROCEDURE ID (LPAREN formal_parameter_list RPAREN)? SEMI block SEMI)*
                        | empty
        """
    
  4. Add the formal_parameter_list rule and method

    def formal_parameter_list(self):
        """ formal_parameter_list : formal_parameters
                                  | formal_parameters SEMI formal_parameter_list
        """
    
  5. Add the formal_parameters rule and method

    def formal_parameters(self):
        """ formal_parameters : ID (COMMA ID)* COLON type_spec """
        param_nodes = []
    

With the addition of the above methods and rules our parser will be able to parse procedure declarations like these (I’m not showing the body of declared procedures for brevity):

procedure Foo;

procedure Foo(a : INTEGER);

procedure Foo(a, b : INTEGER);

procedure Foo(a, b : INTEGER; c : REAL);

Let’s generate an AST for our sample program. Download genastdot.py and run the following command on the command line:

$ python genastdot.py nestedscopes02.pas > ast.dot && dot -Tpng -o ast.png ast.dot

Here is a picture of the generated AST:

You can see now that the ProcedureDecl node in the picture has the Param node as its child.

You can find the complete changes in the spi.py file. Spend some time and study the changes. You’ve done similar changes before; they should be pretty easy to understand and you should be able to implement them by yourself.

Procedure symbols

While we’re on the topic of procedure declarations, let’s also talk about procedure symbols.

As with variable declarations, and built-in type declarations, there is a separate category of symbols for procedures. Let’s create a separate symbol class for procedure symbols:

class ProcedureSymbol(Symbol):
    def __init__(self, name, params=None):
        super(ProcedureSymbol, self).__init__(name)
        # a list of formal parameters
        self.params = params if params is not None else []

    def __str__(self):
        return '<{class_name}(name={name}, parameters={params})>'.format(
            class_name=self.__class__.__name__,
            name=self.name,
            params=self.params,
        )

    __repr__ = __str__

Procedure symbols have a name (it’s a procedure’s name), their category is procedure (it’s encoded in the class name), and the type is None because in Pascal procedures don’t return anything.

Procedure symbols also carry additional information about procedure declarations, namely they contain information about the procedure’s formal parameters as you can see in the code above.

With the addition of procedure symbols, our new symbol hierarchy looks like this:


Nested scopes

After that quick detour let’s get back to our program and the discussion of nested scopes:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin
      x := a + x + y;
   end;

begin { Main }

end.  { Main }

Things are actually getting more interesting here. By declaring a new procedure, we introduce a new scope, and this scope is nested within the global scope introduced by the PROGRAM statement, so this is a case where we have nested scopes in a Pascal program.

The scope of a procedure is the whole body of the procedure. The beginning of the procedure scope is marked by the PROCEDURE keyword and the end is marked by the END keyword and a semicolon.

Let’s subscript names in the program and show some additional information:

Some observations from the picture above:

  • This Pascal program has two scope levels: level 1 and level 2
  • The nesting relationships diagram visually shows that the scope Alpha is nested within the global scope, hence there are two levels: the global scope at level 1, and the Alpha scope at level 2.
  • The scope level of the procedure declaration Alpha is one less than the level of the variables declared inside the procedure Alpha. You can see that the scope level of the procedure declaration Alpha is 1 and the scope level of the variables a and y inside the procedure is 2.
  • The variable declaration of y inside Alpha hides the declaration of y in the global scope. You can see the hole in the vertical bar for y1 (by the way, 1 is a subscript, it’s not part of the variable name, the variable name is just y) and you can see that the scope of the y2 variable declaration is the Alpha procedure’s whole body.
  • The scope information table, as you are already aware, shows scope levels, scope names for those levels, and respective names declared in those scopes (at those levels).
  • In the picture, you can also see that I omitted showing the scope of the integer and real types (except in the scope information table) because they are always declared at scope level 1, the global scope, so I won’t be subscripting the integer and real types anymore to save visual space, but you will see the types again and again in the contents of the scoped symbol table representing the global scope.

The next step is to discuss implementation details.

First, let’s focus on variable and procedure declarations. Then, we’ll discuss variable references and how name resolution works in the presence of nested scopes.

For our discussion, we’ll use a stripped down version of the program. The following version does not have variable references: it only has variable and procedure declarations:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin

   end;

begin { Main }

end.  { Main }

You already know how to represent a scope in code with a scoped symbol table. Now we have two scopes: the global scope and the scope introduced by the procedure Alpha. Following our approach we should now have two scoped symbol tables: one for the global scope and one for the Alpha scope. How do we implement that in code? We’ll extend the semantic analyzer to create a separate scoped symbol table for every scope instead of just for the global scope. The scope construction will happen, as usual, when walking the AST.

First, we need to decide where in the semantic analyzer we’re going to create our scoped symbol tables. Recall that PROGRAM and PROCEDURE keywords introduce new scope. In AST, the corresponding nodes are Program and ProcedureDecl. So we’re going to update our visit_Program method and add the visit_ProcedureDecl method to create scoped symbol tables. Let’s start with the visit_Program method:

def visit_Program(self, node):
    print('ENTER scope: global')
    global_scope = ScopedSymbolTable(
        scope_name='global',
        scope_level=1,
    )
    self.current_scope = global_scope

    # visit subtree
    self.visit(node.block)

    print(global_scope)
    print('LEAVE scope: global')

The method has quite a few changes:

  1. When visiting the node in AST, we first print what scope we’re entering, in this case global.
  2. We create a separate scoped symbol table to represent the global scope. When we construct an instance of ScopedSymbolTable, we explicitly pass the scope name and scope level arguments to the class constructor.
  3. We assign the newly created scope to the instance variable current_scope. Other visitor methods that insert and look up symbols in scoped symbol tables will use the current_scope.
  4. We visit a subtree (block). This is the old part.
  5. Before leaving the global scope we print the contents of the global scope (scoped symbol table)
  6. We also print the message that we’re leaving the global scope

Now let’s add the visit_ProcedureDecl method. Here is the complete source code for it:

def visit_ProcedureDecl(self, node):
    proc_name = node.proc_name
    proc_symbol = ProcedureSymbol(proc_name)
    self.current_scope.insert(proc_symbol)

    print('ENTER scope: %s' %  proc_name)
    # Scope for parameters and local variables
    procedure_scope = ScopedSymbolTable(
        scope_name=proc_name,
        scope_level=2,
    )
    self.current_scope = procedure_scope

    # Insert parameters into the procedure scope
    for param in node.params:
        param_type = self.current_scope.lookup(param.type_node.value)
        param_name = param.var_node.value
        var_symbol = VarSymbol(param_name, param_type)
        self.current_scope.insert(var_symbol)
        proc_symbol.params.append(var_symbol)

    self.visit(node.block_node)

    print(procedure_scope)
    print('LEAVE scope: %s' %  proc_name)

Let’s go over the contents of the method:

  1. The first thing that the method does is create a procedure symbol and insert it into the current scope, which is the global scope for our sample program.
  2. Then the method prints the message about entering the procedure scope.
  3. Then we create a new scope for the procedure’s parameters and variable declarations.
  4. We assign the procedure scope to the self.current_scope variable indicating that this is our current scope and all symbol operations (insert and lookup) will use the current scope.
  5. Then we handle procedure formal parameters by inserting them into the current scope and adding them to the procedure symbol.
  6. Then we visit the rest of the AST subtree - the body of the procedure.
  7. And, finally, we print the message about leaving the scope before leaving the node and moving to another AST node, if any.

Now, what we need to do is update other semantic analyzer visitor methods to use self.current_scope when inserting and looking up symbols. Let’s do that:

def visit_VarDecl(self, node):
    type_name = node.type_node.value
    type_symbol = self.current_scope.lookup(type_name)

    # We have all the information we need to create a variable symbol.
    # Create the symbol and insert it into the symbol table.
    var_name = node.var_node.value
    var_symbol = VarSymbol(var_name, type_symbol)

    self.current_scope.insert(var_symbol)

def visit_Var(self, node):
    var_name = node.value
    var_symbol = self.current_scope.lookup(var_name)
    if var_symbol is None:
        raise Exception(
            "Error: Symbol(identifier) not found '%s'" % var_name
        )

Both the visit_VarDecl and visit_Var will now use the current_scope to insert and/or look up symbols. Specifically, for our sample program, the current_scope can point either to the global scope or the Alpha scope.

We also need to update the semantic analyzer and set the current_scope to None in the constructor:

class SemanticAnalyzer(NodeVisitor):
    def __init__(self):
        self.current_scope = None

Clone the GitHub repository for the article, run scope02.py (it has all the changes we just discussed), inspect the output, and make sure you understand why every line is generated:

$ python scope02.py
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL
Insert: x
Lookup: REAL
Insert: y
Insert: Alpha
ENTER scope: Alpha
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: a
Lookup: INTEGER
Insert: y


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : Alpha
Scope level    : 2
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      a: <VarSymbol(name='a', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


LEAVE scope: Alpha


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='REAL')>
      y: <VarSymbol(name='y', type='REAL')>
  Alpha: <ProcedureSymbol(name=Alpha, parameters=[<VarSymbol(name='a', type='INTEGER')>])>


LEAVE scope: global

Some things about the output above that I think are worth mentioning:

  1. You can see that the two lines Insert: INTEGER and Insert: REAL are repeated twice in the output and the keys INTEGER and REAL are present in both scopes (scoped symbol tables): global and Alpha. The reason is that we create a separate scoped symbol table for every scope and the table initializes the built-in type symbols every time we create its instance. We’ll change it later when we discuss nesting relationships and how they are expressed in code.
  2. See how the line Insert: Alpha is printed before the line ENTER scope: Alpha. This is just a reminder that a name of a procedure is declared at a level that is one less than the level of the variables declared in the procedure itself.
  3. You can see by inspecting the printed contents of the scoped symbol tables above what declarations they contain. See, for example, that global scope has the Alpha symbol in it.
  4. From the contents of the global scope you can also see that the procedure symbol for the Alpha procedure also contains the procedure’s formal parameters.

After we run the program, our scopes in memory would look something like this, just two separate scoped symbol tables:


Scope tree: Chaining scoped symbol tables

Okay, now every scope is represented by a separate scoped symbol table, but how do we represent the nesting relationship between the global scope and the scope Alpha as we showed in the nesting relationship diagram before? In other words, how do we express in code that the scope Alpha is nested within the global scope? The answer is chaining the tables together.

We’ll chain the scoped symbol tables together by creating a link between them. In a way it’ll be like a tree (we’ll call it a scope tree), just an unusual one, because in this tree a child will be pointing to a parent, and not the other way around. Let’s take a look the following scope tree:

In the scope tree above you can see that the scope Alpha is linked to the global scope by pointing to it. To put it differently, the scope Alpha is pointing to its enclosing scope, which is the global scope. It all means that the scope Alpha is nested within the global scope.

How do we implement scope chaining/linking? There are two steps:

  1. We need to update the ScopedSymbolTable class and add a variable enclosing_scope that will hold a pointer to the scope’s enclosing scope. This will be the link between scopes in the picture above.
  2. We need to update the visit_Program and visit_ProcedureDecl methods to create an actual link to the scope’s enclosing scope using the updated version of the ScopedSymbolTable class.

Let’s start with updating the ScopedSymbolTable class and adding the enclosing_scope field. Let’s also update the __init__ and __str__ methods. The __init__ method will be modified to accept a new parameter, enclosing_scope, with the default value set to None. The __str__ method will be updated to output the name of the enclosing scope. Here is the complete source code of the updated ScopedSymbolTable class:

class ScopedSymbolTable(object):
    def __init__(self, scope_name, scope_level, enclosing_scope=None):
        self._symbols = OrderedDict()
        self.scope_name = scope_name
        self.scope_level = scope_level
        self.enclosing_scope = enclosing_scope
        self._init_builtins()

    def _init_builtins(self):
        self.insert(BuiltinTypeSymbol('INTEGER'))
        self.insert(BuiltinTypeSymbol('REAL'))

    def __str__(self):
        h1 = 'SCOPE (SCOPED SYMBOL TABLE)'
        lines = ['\n', h1, '=' * len(h1)]
        for header_name, header_value in (
            ('Scope name', self.scope_name),
            ('Scope level', self.scope_level),
            ('Enclosing scope',
             self.enclosing_scope.scope_name if self.enclosing_scope else None
            )
        ):
            lines.append('%-15s: %s' % (header_name, header_value))
        h2 = 'Scope (Scoped symbol table) contents'
        lines.extend([h2, '-' * len(h2)])
        lines.extend(
            ('%7s: %r' % (key, value))
            for key, value in self._symbols.items()
        )
        lines.append('\n')
        s = '\n'.join(lines)
        return s

    __repr__ = __str__

    def insert(self, symbol):
        print('Insert: %s' % symbol.name)
        self._symbols[symbol.name] = symbol

    def lookup(self, name):
        print('Lookup: %s' % name)
        symbol = self._symbols.get(name)
        # 'symbol' is either an instance of the Symbol class or None
        return symbol

Now let’s switch our attention to the visit_Program method:

def visit_Program(self, node):
    print('ENTER scope: global')
    global_scope = ScopedSymbolTable(
        scope_name='global',
        scope_level=1,
        enclosing_scope=self.current_scope, # None
    )
    self.current_scope = global_scope

    # visit subtree
    self.visit(node.block)

    print(global_scope)

    self.current_scope = self.current_scope.enclosing_scope
    print('LEAVE scope: global')

There are a couple of things here worth mentioning and repeating:

  1. We explicitly pass the self.current_scope as the enclosing_scope argument when creating a scope
  2. We assign the newly created global scope to the variable self.current_scope
  3. We restore the variable self.current_scope to its previous value right before leaving the Program node. It’s important to restore the value of the current_scope after we’ve finished processing the node, otherwise the scope tree construction will be broken when we have more than two scopes in our program. We’ll see why shortly.

And, finally, let’s update the visit_ProcedureDecl method:

def visit_ProcedureDecl(self, node):
    proc_name = node.proc_name
    proc_symbol = ProcedureSymbol(proc_name)
    self.current_scope.insert(proc_symbol)

    print('ENTER scope: %s' %  proc_name)
    # Scope for parameters and local variables
    procedure_scope = ScopedSymbolTable(
        scope_name=proc_name,
        scope_level=self.current_scope.scope_level + 1,
        enclosing_scope=self.current_scope
    )
    self.current_scope = procedure_scope

    # Insert parameters into the procedure scope
    for param in node.params:
        param_type = self.current_scope.lookup(param.type_node.value)
        param_name = param.var_node.value
        var_symbol = VarSymbol(param_name, param_type)
        self.current_scope.insert(var_symbol)
        proc_symbol.params.append(var_symbol)

    self.visit(node.block_node)

    print(procedure_scope)

    self.current_scope = self.current_scope.enclosing_scope
    print('LEAVE scope: %s' %  proc_name)

Again, the main changes compared to the version in scope02.py are:

  1. We explicitly pass the self.current_scope as an enclosing_scope argument when creating a scope.
  2. We no longer hard code the scope level of a procedure declaration because we can calculate the level automatically based on the scope level of the procedure’s enclosing scope: it’s the enclosing scope’s level plus one.
  3. We restore the value of the self.current_scope to its previous value (for our sample program the previous value would be the global scope) right before leaving the ProcedureDecl node.

Okay, let’s see what the contents of the scoped symbol tables look like with the above changes. You can find all the changes in scope03a.py. Our sample program is:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin

   end;

begin { Main }

end.  { Main }

Run scope03a.py on the command line and inspect the output:

$ python scope03a.py
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL
Insert: x
Lookup: REAL
Insert: y
Insert: Alpha
ENTER scope: Alpha
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: a
Lookup: INTEGER
Insert: y


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : Alpha
Scope level    : 2
Enclosing scope: global
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      a: <VarSymbol(name='a', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


LEAVE scope: Alpha


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Enclosing scope: None
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='REAL')>
      y: <VarSymbol(name='y', type='REAL')>
  Alpha: <ProcedureSymbol(name=Alpha, parameters=[<VarSymbol(name='a', type='INTEGER')>])>


LEAVE scope: global

You can see in the output above that the global scope doesn’t have an enclosing scope and, the Alpha‘s enclosing scope is the global scope, which is what we would expect, because the Alpha scope is nested within the global scope.


Now, as promised, let’s consider why it is important to set and restore the value of the self.current_scope variable. Let’s take a look at the following program, where we have two procedure declarations in the global scope:

program Main;
   var x, y : real;

   procedure AlphaA(a : integer);
      var y : integer;
   begin { AlphaA }

   end;  { AlphaA }

   procedure AlphaB(a : integer);
      var b : integer;
   begin { AlphaB }

   end;  { AlphaB }

begin { Main }

end.  { Main }

The nesting relationship diagram for the sample program looks like this:

An AST for the program (I left only the nodes that are relevant to this example) is something like this:

If we don’t restore the current scope when we leave the Program and ProcedureDecl nodes what is going to happen? Let’s see.

The way our semantic analyzer walks the tree is depth first, left-to-right, so it will traverse the ProcedureDecl node for AlphaA first and then it will visit the ProcedureDecl node for AlphaB. The problem here is that if we don’t restore the self.current_scope before leaving AlphaA the self.current_scope will be left pointing to AlphaA instead of the global scope and, as a result, the semantic analyzer will create the scope AlphaB at level 3, as if it was nested within the scope AlphaA, which is, of course, incorrect.

To see the broken behavior when the current scope is not being restored before leaving Program and/or ProcedureDecl nodes, download and run the scope03b.py on the command line:

$ python scope03b.py
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL
Insert: x
Lookup: REAL
Insert: y
Insert: AlphaA
ENTER scope: AlphaA
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: a
Lookup: INTEGER
Insert: y


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : AlphaA
Scope level    : 2
Enclosing scope: global
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      a: <VarSymbol(name='a', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


LEAVE scope: AlphaA
Insert: AlphaB
ENTER scope: AlphaB
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: a
Lookup: INTEGER
Insert: b


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : AlphaB
Scope level    : 3
Enclosing scope: AlphaA
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      a: <VarSymbol(name='a', type='INTEGER')>
      b: <VarSymbol(name='b', type='INTEGER')>


LEAVE scope: AlphaB


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Enclosing scope: None
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='REAL')>
      y: <VarSymbol(name='y', type='REAL')>
 AlphaA: <ProcedureSymbol(name=AlphaA, parameters=[<VarSymbol(name='a', type='INTEGER')>])>


LEAVE scope: global

As you can see, scope tree construction in our semantic analyzer is completely broken in the presence of more than two scopes:

  1. Instead of two scope levels as shown in the nesting relationships diagram, we have three levels
  2. The global scope contents doesn’t have AlphaB in it, only AlphaA.


To construct a scope tree correctly, we need to follow a really simple procedure:

  1. When we ENTER a Program or ProcedureDecl node, we create a new scope and assign it to the self.current_scope.
  2. When we are about to LEAVE the Program or ProcedureDecl node, we restore the value of the self.current_scope.

You can think of the self.current_scope as a stack pointer and a scope tree as a collection of stacks:

  1. When you visit a Program or ProcedureDecl node, you push a new scope on the stack and adjust the stack pointer self.current_scope to point to the top of stack, which is now the most recently pushed scope.
  2. When you are about to leave the node, you pop the scope off the stack and you also adjust the stack pointer to point to the previous scope on the stack, which is now the new top of stack.

To see the correct behavior in the presence of multiple scopes, download and run scope03c.py on the command line. Study the output. Make sure you understand what is going on:

$ python scope03c.py
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL
Insert: x
Lookup: REAL
Insert: y
Insert: AlphaA
ENTER scope: AlphaA
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: a
Lookup: INTEGER
Insert: y


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : AlphaA
Scope level    : 2
Enclosing scope: global
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      a: <VarSymbol(name='a', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


LEAVE scope: AlphaA
Insert: AlphaB
ENTER scope: AlphaB
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: a
Lookup: INTEGER
Insert: b


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : AlphaB
Scope level    : 2
Enclosing scope: global
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      a: <VarSymbol(name='a', type='INTEGER')>
      b: <VarSymbol(name='b', type='INTEGER')>


LEAVE scope: AlphaB


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Enclosing scope: None
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='REAL')>
      y: <VarSymbol(name='y', type='REAL')>
 AlphaA: <ProcedureSymbol(name=AlphaA, parameters=[<VarSymbol(name='a', type='INTEGER')>])>
 AlphaB: <ProcedureSymbol(name=AlphaB, parameters=[<VarSymbol(name='a', type='INTEGER')>])>


LEAVE scope: global

This is how our scoped symbol tables look like after we’ve run scope03c.py and correctly constructed the scope tree:

Again, as I’ve mentioned above, you can think of the scope tree above as a collection of scope stacks.

Now let’s continue and talk about how name resolution works when we have nested scopes.


Nested scopes and name resolution

Our focus before was on variable and procedure declarations. Let’s add variable references to the mix.

Here is a sample program with some variable references in it:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin
      x := a + x + y;
   end;

begin { Main }

end.  { Main }

Or visually with some additional information:


Let’s turn our attention to the assignment statement x := a + x + y; Here it is with subscripts:

We see that x resolves to a declaration at level 1, a resolves to a declaration at level 2 and y also resolves to a declaration at level 2. How does that resolution work? Let’s see how.

Lexically (statically) scoped languages like Pascal follow the most closely nested scope rule when it comes to name resolution. It means that, in every scope, a name refers to its lexically closest declaration. For our assignment statement, let’s go over every variable reference and see how the rule works in practice:

  1. Because our semantic analyzer visits the right-hand side of the assignment first, we’ll start with the variable reference a from the arithmetic expression a + x + y. We begin our search for a‘s declaration in the lexically closest scope, which is the Alpha scope. The Alpha scope contains variable declarations in the Alpha procedure including the procedure’s formal parameters. We find the declaration of a in the Alpha scope: it’s the formal parameter a of the Alpha procedure - a variable symbol that has type integer. We usually do the search by scanning the source code with our eyes when resolving names (remember, a2 is not the name of a variable, 2 is the subscript here, the variable name is a):

  2. Now onto the variable reference x from the arithmetic expression a + x + y. Again, first we search for the declaration of x in the lexically closest scope. The lexically closest scope is the Alpha scope at level 2. The scope contains declarations in the Alpha procedure including the procedure’s formal parameters. We don’t find x at this scope level (in the Alpha scope), so we go up the chain to the global scope and continue our search there. Our search succeeds because the global scope has a variable symbol with the name x in it:

  3. Now, let’s look at the variable reference y from the arithmetic expression a + x + y. We find its declaration in the lexically closest scope, which is the Alpha scope. In the Alpha scope the variable y has type integer (if there weren’t a declaration for y in the Alpha scope we would scan the text and find y in the outer/global scope and it would have real type in that case):

  4. And, finally, the variable x from the left hand side of the assignment statement x := a + x + y; It resolves to the same declaration as the variable reference x in the arithmetic expression on the right-hand side:

How do we implement that behavior of looking in the current scope, and then looking in the enclosing scope, and so on until we either find the symbol we’re looking for or we’ve reached the top of the scope tree and there are no more scopes left? We simply need to extend the lookup method in the ScopedSymbolTable class to continue its search up the chain in the scope tree:

def lookup(self, name):
    print('Lookup: %s. (Scope name: %s)' % (name, self.scope_name))
    # 'symbol' is either an instance of the Symbol class or None
    symbol = self._symbols.get(name)

    if symbol is not None:
        return symbol

    # recursively go up the chain and lookup the name
    if self.enclosing_scope is not None:
        return self.enclosing_scope.lookup(name)

The way the updated lookup method works:

  1. Search for a symbol by name in the current scope. If the symbol is found, then return it.
  2. If the symbol is not found, recursively traverse the tree and search for the symbol in the scopes up the chain. You don’t have to do the lookup recursively, you can rewrite it into an iterative form; the important part is to follow the link from a nested scope to its enclosing scope and search for the symbol there and up the tree until either the symbol is found or there are no more scopes left because you’ve reached the top of the scope tree.
  3. The lookup method also prints the scope name, in parenthesis, where the lookup happens to make it clearer that lookup goes up the chain to search for a symbol, if it can’t find it in the current scope.

Let’s see what our semantic analyzer outputs for our sample program now that we’ve modified the way the lookup searches the scope tree for a symbol. Download scope04a.py and run it on the command line:

$ python scope04a.py
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL. (Scope name: global)
Insert: x
Lookup: REAL. (Scope name: global)
Insert: y
Insert: Alpha
ENTER scope: Alpha
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Insert: a
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Insert: y
Lookup: a. (Scope name: Alpha)
Lookup: x. (Scope name: Alpha)
Lookup: x. (Scope name: global)
Lookup: y. (Scope name: Alpha)
Lookup: x. (Scope name: Alpha)
Lookup: x. (Scope name: global)


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : Alpha
Scope level    : 2
Enclosing scope: global
Scope (Scoped symbol table) contents
------------------------------------
      a: <VarSymbol(name='a', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


LEAVE scope: Alpha


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Enclosing scope: None
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='REAL')>
      y: <VarSymbol(name='y', type='REAL')>
  Alpha: <ProcedureSymbol(name=Alpha, parameters=[<VarSymbol(name='a', type='INTEGER')>])>


LEAVE scope: global

Inspect the output above and pay attention to the ENTER and Lookup messages. A couple of things worth mentioning here:

  1. Notice how the semantic analyzer looks up the INTEGER built-in type symbol before inserting the variable symbol a. It searches INTEGER first in the current scope, Alpha, doesn’t find it, then goes up the tree all the way to the global scope, and finds the symbol there:

    ENTER scope: Alpha
    Lookup: INTEGER. (Scope name: Alpha)
    Lookup: INTEGER. (Scope name: global)
    Insert: a
    
  2. Notice also how the analyzer resolves variable references from the assignment statement x := a + x + y:

    Lookup: a. (Scope name: Alpha)
    Lookup: x. (Scope name: Alpha)
    Lookup: x. (Scope name: global)
    Lookup: y. (Scope name: Alpha)
    Lookup: x. (Scope name: Alpha)
    Lookup: x. (Scope name: global)
    

    The analyzer starts its search in the current scope and then goes up the tree all the way to the global scope.

Let’s also see what happens when a Pascal program has a variable reference that doesn’t resolve to a variable declaration as in the sample program below:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin
      x := b + x + y; { ERROR here! }
   end;

begin { Main }

end.  { Main }

Download scope04b.py and run it on the command line:

$ python scope04b.py
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL. (Scope name: global)
Insert: x
Lookup: REAL. (Scope name: global)
Insert: y
Insert: Alpha
ENTER scope: Alpha
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Insert: a
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Insert: y
Lookup: b. (Scope name: Alpha)
Lookup: b. (Scope name: global)
Error: Symbol(identifier) not found 'b'

As you can see, the analyzer tried to resolve the variable reference b and searched for it in the Alpha scope first, then the global scope, and, not being able to find a symbol with the name b, it threw the semantic error.

Okay great, now we know how to write a semantic analyzer that can analyze a program for semantic errors when the program has nested scopes.


Source-to-source compiler

Now, onto something completely different. Let’s write a source-to-source compiler! Why would we do it? Aren’t we talking about interpreters and nested scopes? Yes, we are, but let me explain why I think it might be a good idea to learn how to write a source-to-source compiler right now.

First, let’s talk about definitions. What is a source-to-source compiler? For the purpose of this article, let’s define a source-to-source compiler as a compiler that translates a program in some source language into a program in the same (or almost the same) source language.

So, if you write a translator that takes as an input a Pascal program and outputs a Pascal program, possibly modified, or enhanced, the translator in this case is called a source-to-source compiler.

A good example of a source-to-source compiler for us to study would be a compiler that takes a Pascal program as an input and outputs a Pascal-like program where every name is subscripted with a corresponding scope level, and, in addition to that, every variable reference also has a type indicator. So we want a source-to-source compiler that would take the following Pascal program:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
   begin
      x := a + x + y;
   end;

begin { Main }

end.  { Main }

and turn it into the following Pascal-like program:

program Main0;
   var x1 : REAL;
   var y1 : REAL;
   procedure Alpha1(a2 : INTEGER);
      var y2 : INTEGER;

   begin
      <x1:REAL> := <a2:INTEGER> + <x1:REAL> + <y2:INTEGER>;
   end; {END OF Alpha}

begin

end. {END OF Main}

Here is the list of modifications our source-to-source compiler should make to an input Pascal program:

  1. Every declaration should be printed on a separate line, so if we have multiple declarations in the input Pascal program, the compiled output should have each declaration on a separate line. We can see in the text above, for example, how the line var x, y : real; gets converted into multiple lines.
  2. Every name should get subscripted with a number corresponding to the scope level of the respective declaration.
  3. Every variable reference, in addition to being subscripted, should also be printed in the following form: <var_name_with_subscript:type>
  4. The compiler should also add a comment at the end of every block in the form {END OF … }, where the ellipses will get substituted either with a program name or procedure name. That will help us identify the textual boundaries of procedures faster.

As you can see from the generated output above, this source-to-source compiler could be a useful tool for understanding how name resolution works, especially when a program has nested scopes, because the output generated by the compiler would allow us to quickly see to what declaration and in what scope a certain variable reference resolves to. This is good help when learning about symbols, nested scopes, and name resolution.

How can we implement a source-to-source compiler like that? We have actually covered all the necessary parts to do it. All we need to do now is extend our semantic analyzer a bit to generate the enhanced output. You can see the full source code of the compiler here. It is basically a semantic analyzer on drugs, modified to generate and return strings for certain AST nodes.

Download src2srccompiler.py, study it, and experiment with it by passing it different Pascal programs as an input.

For the following program, for example:

program Main;
   var x, y : real;
   var z : integer;

   procedure AlphaA(a : integer);
      var y : integer;
   begin { AlphaA }
      x := a + x + y;
   end;  { AlphaA }

   procedure AlphaB(a : integer);
      var b : integer;
   begin { AlphaB }
   end;  { AlphaB }

begin { Main }
end.  { Main }

The compiler generates the following output:

$ python src2srccompiler.py nestedscopes03.pas
program Main0;
   var x1 : REAL;
   var y1 : REAL;
   var z1 : INTEGER;
   procedure AlphaA1(a2 : INTEGER);
      var y2 : INTEGER;

   begin
      <x1:REAL> := <a2:INTEGER> + <x1:REAL> + <y2:INTEGER>;
   end; {END OF AlphaA}
   procedure AlphaB1(a2 : INTEGER);
      var b2 : INTEGER;

   begin

   end; {END OF AlphaB}

begin

end. {END OF Main}

Cool beans and congratulations, now you know how to write a basic source-to-source compiler!

Use it to further your understanding of nested scopes, name resolution, and what you can do when you have an AST and some extra information about the program in the form of symbol tables.


Now that we have a useful tool to subscript our programs for us, let’s take a look at a bigger example of nested scopes that you can find in nestedscopes04.pas:

program Main;
   var b, x, y : real;
   var z : integer;

   procedure AlphaA(a : integer);
      var b : integer;

      procedure Beta(c : integer);
         var y : integer;

         procedure Gamma(c : integer);
            var x : integer;
         begin { Gamma }
            x := a + b + c + x + y + z;
         end;  { Gamma }

      begin { Beta }

      end;  { Beta }

   begin { AlphaA }

   end;  { AlphaA }

   procedure AlphaB(a : integer);
      var c : real;
   begin { AlphaB }
      c := a + b;
   end;  { AlphaB }

begin { Main }
end.  { Main }

Below you can see the declarations’ scopes, nesting relationships diagram, and scope information table:

Let’s run our source-to-source compiler and inspect the output. The subscripts should match the ones in the scope information table in the picture above:

$ python src2srccompiler.py nestedscopes04.pas
program Main0;
   var b1 : REAL;
   var x1 : REAL;
   var y1 : REAL;
   var z1 : INTEGER;
   procedure AlphaA1(a2 : INTEGER);
      var b2 : INTEGER;
      procedure Beta2(c3 : INTEGER);
         var y3 : INTEGER;
         procedure Gamma3(c4 : INTEGER);
            var x4 : INTEGER;

         begin
            <x4:INTEGER> := <a2:INTEGER> + <b2:INTEGER> + <c4:INTEGER> + <x4:INTEGER> + <y3:INTEGER> + <z1:INTEGER>;
         end; {END OF Gamma}

      begin

      end; {END OF Beta}

   begin

   end; {END OF AlphaA}
   procedure AlphaB1(a2 : INTEGER);
      var c2 : REAL;

   begin
      <c2:REAL> := <a2:INTEGER> + <b1:REAL>;
   end; {END OF AlphaB}

begin

end. {END OF Main}

Spend some time studying both the pictures and the output of the source-to-source compiler. Make sure you understand the following main points:

  • The way the vertical lines are drawn to show the scope of the declarations.
  • That a hole in a scope indicates that a variable is re-declared in a nested scope.
  • That AlphaA and AlphaB are declared in the global scope.
  • That AlphaA and AlphaB declarations introduce new scopes.
  • How scopes are nested within each other, and their nesting relationships.
  • Why different names, including variable references in assignment statements, are subscripted the way they are. In other words, how name resolution and specifically the lookup method of chained scoped symbol tables works.

Also run the following program:

$ python scope05.py nestedscopes04.pas

and inspect the contents of the chained scoped symbol tables and compare it with what you see in the scope information table in the picture above. And don’t forget about the genastdot.py, which you can use to generate a visual diagram of an AST to see how procedures are nested within each other in the tree.


Before we wrap up our discussion of nested scopes for today, recall that earlier we removed the semantic check that was checking source programs for duplicate identifiers. Let’s put it back. For the check to work in the presence of nested scopes and the new behavior of the lookup method, though, we need to make some changes. First, we need to update the lookup method and add an extra parameter that will allow us to limit our search to the current scope only:

def lookup(self, name, current_scope_only=False):
    print('Lookup: %s. (Scope name: %s)' % (name, self.scope_name))
    # 'symbol' is either an instance of the Symbol class or None
    symbol = self._symbols.get(name)

    if symbol is not None:
        return symbol

    if current_scope_only:
        return None

    # recursively go up the chain and lookup the name
    if self.enclosing_scope is not None:
        return self.enclosing_scope.lookup(name)

And second, we need to modify the visit_VarDecl method and add the check using our new current_scope_only parameter in the lookup method:

def visit_VarDecl(self, node):
    type_name = node.type_node.value
    type_symbol = self.current_scope.lookup(type_name)

    # We have all the information we need to create a variable symbol.
    # Create the symbol and insert it into the symbol table.
    var_name = node.var_node.value
    var_symbol = VarSymbol(var_name, type_symbol)

    # Signal an error if the table alrady has a symbol
    # with the same name
    if self.current_scope.lookup(var_name, current_scope_only=True):
        raise Exception(
            "Error: Duplicate identifier '%s' found" % var_name
        )

    self.current_scope.insert(var_symbol)

If we don’t limit the search for a duplicate identifier to the current scope, the lookup might find a variable symbol with the same name in an outer scope and, as a result, would throw an error, while in reality there was no semantic error to begin with.

Here is the output from running scope05.py with a program that doesn’t have duplicate identifier errors. You can notice below that the output has more lines in it, due to our duplicate identifier check that looks up for a duplicate name before inserting a new symbol:

$ python scope05.py nestedscopes02.pas
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL. (Scope name: global)
Lookup: x. (Scope name: global)
Insert: x
Lookup: REAL. (Scope name: global)
Lookup: y. (Scope name: global)
Insert: y
Insert: Alpha
ENTER scope: Alpha
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Insert: a
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Lookup: y. (Scope name: Alpha)
Insert: y
Lookup: a. (Scope name: Alpha)
Lookup: x. (Scope name: Alpha)
Lookup: x. (Scope name: global)
Lookup: y. (Scope name: Alpha)
Lookup: x. (Scope name: Alpha)
Lookup: x. (Scope name: global)


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : Alpha
Scope level    : 2
Enclosing scope: global
Scope (Scoped symbol table) contents
------------------------------------
      a: <VarSymbol(name='a', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


LEAVE scope: Alpha


SCOPE (SCOPED SYMBOL TABLE)
===========================
Scope name     : global
Scope level    : 1
Enclosing scope: None
Scope (Scoped symbol table) contents
------------------------------------
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='REAL')>
      y: <VarSymbol(name='y', type='REAL')>
  Alpha: <ProcedureSymbol(name=Alpha, parameters=[<VarSymbol(name='a', type='INTEGER')>])>


LEAVE scope: global

Now, let’s take scope05.py for another test drive and see how it catches a duplicate identifier semantic error.

For example, for the following erroneous program with a duplicate declaration of a in the Alpha scope:

program Main;
   var x, y: real;

   procedure Alpha(a : integer);
      var y : integer;
      var a : real;  { ERROR here! }
   begin
      x := a + x + y;
   end;

begin { Main }

end.  { Main }

the program generates the following output:

$ python scope05.py dupiderror.pas
ENTER scope: global
Insert: INTEGER
Insert: REAL
Lookup: REAL. (Scope name: global)
Lookup: x. (Scope name: global)
Insert: x
Lookup: REAL. (Scope name: global)
Lookup: y. (Scope name: global)
Insert: y
Insert: Alpha
ENTER scope: Alpha
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Insert: a
Lookup: INTEGER. (Scope name: Alpha)
Lookup: INTEGER. (Scope name: global)
Lookup: y. (Scope name: Alpha)
Insert: y
Lookup: REAL. (Scope name: Alpha)
Lookup: REAL. (Scope name: global)
Lookup: a. (Scope name: Alpha)
Error: Duplicate identifier 'a' found

It caught the error as expected.

On this positive note, let’s wrap up our discussion of scopes, scoped symbol tables, and nested scopes for today.


Summary

We’ve covered a lot of ground. Let’s quickly recap what we learned in this article:

  • We learned about scopes, why they are useful, and how to implement them in code.
  • We learned about nested scopes and how chained scoped symbol tables are used to implement nested scopes.
  • We learned how to code a semantic analyzer that walks an AST, builds scoped symbols tables, chains them together, and does various semantic checks.
  • We learned about name resolution and how the semantic analyzer resolves names to their declarations using chained scoped symbol tables (scopes) and how the lookup method recursively goes up the chain in a scope tree to find a declaration corresponding to a certain name.
  • We learned that building a scope tree in the semantic analyzer involves walking an AST, “pushing” a new scope on top of a scoped symbol table stack when ENTERing a certain AST node and “popping” the scope off the stack when LEAVing the node, making a scope tree look like a collection of scoped symbol table stacks.
  • We learned how to write a source-to-source compiler, which can be a useful tool when learning about nested scopes, scope levels, and name resolution.


Exercises

Time for exercises, oh yeah!

  1. You’ve seen in the pictures throughout the article that the Main name in a program statement had subscript zero. I also mentioned that the program’s name is not in the global scope and it’s in some other outer scope that has level zero. Extend spi.py and create a builtins scope, a new scope at level 0, and move the built-in types INTEGER and REAL into that scope. For fun and practice, you can also update the code to put the program name into that scope as well.

  2. For the source program in nestedscopes04.pas do the following:

    1. Write down the source Pascal program on a piece of paper
    2. Subscript every name in the program indicating the scope level of the declaration the name resolves to.
    3. Draw vertical lines for every name declaration (variable and procedure) to visually show its scope. Don’t forget about scope holes and their meaning when drawing.
    4. Write a source-to-source compiler for the program without looking at the example source-to-source compiler in this article.
    5. Use the original src2srccompiler.py program to verify the output from your compiler and whether you subscripted the names correctly in the exercise (2.2).
  3. Modify the source-to-source compiler to add subscripts to the built-in types INTEGER and REAL

  4. Uncomment the following block in the spi.py

    # interpreter = Interpreter(tree)
    # result = interpreter.interpret()
    # print('')
    # print('Run-time GLOBAL_MEMORY contents:')
    # for k, v in sorted(interpreter.GLOBAL_MEMORY.items()):
    #     print('%s = %s' % (k, v))
    

    Run the interpreter with the part10.pas file as an input:

    $ python spi.py part10.pas
    

    Spot the problems and add the missing methods to the semantic analyzer.


That’s it for today. In the next article we’ll learn about runtime, call stack, implement procedure calls, and write our first version of a recursive factorial function. Stay tuned and see you soon!


If you’re interested, here is a list of books (affiliate links) I referred to most when preparing the article:

  1. Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages (Pragmatic Programmers)

  2. Engineering a Compiler, Second Edition

  3. Programming Language Pragmatics, Fourth Edition

  4. Compilers: Principles, Techniques, and Tools (2nd Edition)

  5. Writing Compilers and Interpreters: A Software Engineering Approach


All articles in this series:


Let’s Build A Simple Interpreter. Part 13: Semantic Analysis.

Date

Anything worth doing is worth overdoing.

Before doing a deep dive into the topic of scopes, I’d like to make a “quick” detour and talk in more detail about symbols, symbol tables, and semantic analysis. In the spirit of “Anything worth doing is worth overdoing”, I hope you’ll find the material useful for building a more solid foundation before tackling nested scopes. Today we will continue to increase our knowledge of how to write interpreters and compilers. You will see that some of the material covered in this article has parts that are much more extended versions of what you saw in Part 11, where we discussed symbols and symbol tables.


In the final four articles of the series we’ll discuss the remaining bits and pieces. Below you can see the major topics we will cover and the timeline:

Okay, let’s get started!


Introduction to semantic analysis

While our Pascal program can be grammatically correct and the parser can successfully build an abstract syntax tree, the program still can contain some pretty serious errors. To catch those errors we need to use the abstract syntax tree and the information from the symbol table.

Why can’t we check for those errors during parsing, that is, during syntax analysis? Why do we have to build an AST and something called the symbol table to do that?

In a nutshell, for convenience and the separation of concerns. By moving those extra checks into a separate phase, we can focus on one task at a time without making our parser and interpreter do more work than they are supposed to do.

When the parser has finished building the AST, we know that the program is grammatically correct; that is, that its syntax is correct according to our grammar rules and now we can separately focus on checking for errors that require additional context and information that the parser did not have at the time of building the AST. To make it more concrete, let’s take a look at the following Pascal assignment statement:

x := x + y;


The parser will handle it all right because, grammatically, the statement is correct (according to our previously defined grammar rules for assignment statements and expressions). But that’s not the end of the story yet, because Pascal has a requirement that variables must be declared with their corresponding types before they are used. How does the parser know whether x and y have been declared yet?

Well, it doesn’t and that’s why we need a separate semantic analysis phase to answer the question (among many others) of whether the variables have been declared prior to their use.

What is semantic analysis? Basically, it’s just a process to help us determine whether a program makes sense, and that it has meaning, according to a language definition.

What does it even mean for a program to make sense? It depends in large part on a language definition and language requirements.

Pascal language and, specifically, Free Pascal’s compiler, has certain requirements that, if not followed in a program, would lead to an error from the fpc compiler indicating that the program doesn’t “make sense”, that it is incorrect, even though the syntax might look okay. Here are some of those requirements:

  • The variables must be declared before they are used
  • The variables must have matching types when used in arithmetic expressions (this is a big part of semantic analysis called type checking that we’ll cover separately)
  • There should be no duplicate declarations (Pascal prohibits, for example, having a local variable in a procedure with the same name as one of the procedure’s formal parameters)
  • A name reference in a call to a procedure must refer to the actual declared procedure (It doesn’t make sense in Pascal if, in the procedure call foo(), the name foo refers to a variable foo of a primitive type INTEGER)
  • A procedure call must have the correct number of arguments and the arguments’ types must match those of formal parameters in the procedure declaration

It is much easier to enforce the above requirements when we have enough context about the program, namely, an intermediate representation in the form of an AST that we can walk and the symbol table with information about different program entities like variables, procedures, and functions.

After we implement the semantic analysis phase, the structure of our Pascal interpreter will look something like this:

From the picture above you can see that our lexer will get source code as an input, transform that into tokens that the parser will consume and use to verify that the program is grammatically correct, and then it will generate an abstract syntax tree that our new semantic analysis phase will use to enforce different Pascal language requirements. During the semantic analysis phase, the semantic analyzer will also build and use the symbol table. After the semantic analysis, our interpreter will take the AST, evaluate the program by walking the AST, and produce the program output.

Let’s get into the details of the semantic analysis phase.

Symbols and symbol tables

In the following section, we’re going to discuss how to implement some of the semantic checks and how to build the symbol table: in other words, we are going to discuss how to perform a semantic analysis of our Pascal programs. Keep in mind that even though semantic analysis sounds fancy and deep, it’s just another step after parsing our program and creating an AST to check the source program for some additional errors that the parser couldn’t catch due to a lack of additional information (context).

Today we’re going to focus on the following two static semantic checks*:

  1. That variables are declared before they are used
  2. That there are no duplicate variable declarations

*ASIDE: Static semantic checks are the checks that we can make before interpreting (evaluating) the program, that is, before calling the interpret method on an instance of the Interpreter class. All the Pascal requirements mentioned before can be enforced with static semantic checks by walking an AST and using information from the symbol table.

Dynamic semantic checks, on the other hand, would require checks to be performed during the interpretation (evaluation) of the program. For example, a check that there is no division by zero, and that an array index is not out of bounds would be a dynamic semantic check. Our focus today is on static semantic checks.

Let’s start with our first check and make sure that in our Pascal programs variables are declared before they are used. Take a look at the following syntactically correct but semantically incorrect program (ugh… too many hard to pronounce words in one sentence. :)

program Main;
   var x : integer;

begin
    x := y;
end.

The program above has one variable declaration and two variable references. You can see that in the picture below:

Let’s actually verify that our program is syntactically correct and that our parser doesn’t throw an error when parsing it. As they say, trust but verify. :) Download spi.py, fire off a Python shell, and see for yourself:

>>> from spi import Lexer, Parser
>>> text = """
program Main;
   var x : integer;

begin
    x := y;
end.
"""
>>>
>>> lexer = Lexer(text)
>>> parser = Parser(lexer)
>>> tree = parser.parse()
>>>

You see? No errors. We can even generate an AST diagram for that program using genastdot.py. First, save the source code into a file, let’s say semanticerror01.pas, and run the following commands:

$ python genastdot.py semanticerror01.pas > semanticerror01.dot
$ dot -Tpng -o ast.png semanticerror01.dot

Here is the AST diagram:

So, it is a grammatically (syntactically) correct program, but the program doesn’t make sense because we don’t even know what type the variable y has (that’s why we need declarations) and if it will make sense to assign y to x. What if y is a string, does it make sense to assign a string to an integer? It does not, at least not in Pascal.

So the program above has a semantic error because the variable y is not declared and we don’t know its type. In order for us to be able to catch errors like that, we need to learn how to check that variables are declared before they are used. So let’s learn how to do it.

Let’s take a closer look at the following syntactically and semantically correct sample program:

program Main;
   var x, y : integer;

begin
    x := x + y;
end.
  • It has two variable declarations: x and y
  • It also has three variable references (x, another x, and y) in the assignment statement x := x + y;

The program is grammatically correct, all the variables are declared, and we can see that adding two integers and assigning the result to an integer makes perfect sense. That’s great, but how do we programmatically check that the variables (variable references) x and y in the assignment statement x := x +y; have been declared?

We can do this in several steps by implementing the following algorithm:

  1. Go over all variable declarations
  2. For every variable declaration you encounter, collect all necessary information about the declared variable
  3. Store the collected information in some stash for future reference by using the variable’s name as a key
  4. When you see a variable reference, such as in the assignment statement x := x + y, search the stash by the variable’s name to see if the stash has any information about the variable. If it does, the variable has been declared. If it doesn’t, the variable hasn’t been declared yet, which is a semantic error.

This is what a flowchart of our algorithm could look like:

Before we can implement the algorithm, we need to answer several questions:

  • A. What information about variables do we need to collect?
  • B. Where and how should we store the collected information?
  • C. How do we implement the “go over all variable declarations” step?

Our plan of attack will be the following:

  1. Figure out answers to the questions A, B, and C above.
  2. Use the answers to A, B, and C to implement the steps in the algorithm for our first static semantic check: a check that variables are declared before they are used.

Okay, let’s get started.

Let’s find an answer to the question “What information about variables do we need to collect?”

So, what necessary information do we need to collect about a variable? Here are the important parts:

  • Name (we need to know the name of a declared variable because later we will be looking up variables by their names)
  • Category (we need to know what kind of an identifier it is: variable, type, procedure, and so on)
  • Type (we’ll need this information for type checking)

Symbols will hold that information (name, category, and type) about our variables. What’s a symbol? A symbol is an identifier of some program entity like a variable, subroutine, or built-in type.

In the following sample program we have two variable declarations that we will use to create two variable symbols: x, and y.

In the code, we’ll represent symbols with a class called Symbol that has fields name and type :

class Symbol(object):
    def __init__(self, name, type=None):
        self.name = name
        self.type = type

As you can see, the class takes the name parameter and an optional type parameter (not all symbols have type information associated with them, as we’ll see shortly).

What about the category? We will encode category into the class name. Alternatively, we could store the category of a symbol in the dedicated category field of the Symbol class as in:

class Symbol(object):
    def __init__(self, name, type=None):
        self.name = name
        self.type = type
        self.category = category

However, it’s more explicit to create a hierarchy of classes where the name of the class indicates its category.

Up until now I’ve sort of skirted around one topic, that of built-in types. If you look at our sample program again:

program Main;
   var x, y : integer;

begin
    x := x + y;
end.

You can see that variables x and y are declared as integers. What is the integer type? The integer type is another kind of symbol, a built-in type symbol. It’s called built-in because it doesn’t have to be declared explicitly in a Pascal program. It’s our interpreter’s responsibility to declare that type symbol and make it available to programmers:

We are going to make a separate class for built-in types called BuiltinTypeSymbol. Here is the class definition for our built-in types:

class BuiltinTypeSymbol(Symbol):
    def __init__(self, name):
        super(BuiltinTypeSymbol, self).__init__(name)

    def __str__(self):
        return self.name

    def __repr__(self):
        return "<{class_name}(name='{name}')>".format(
            class_name=self.__class__.__name__,
            name=self.name,
        )

The class BuiltinTypeSymbol inherits from the Symbol class, and its constructor requires only the name of the type, like integer or real. The ‘builtin type’ category is encoded in the class name, as we discussed earlier, and the type parameter from the base class is automatically set to None when we create a new instance of the BuiltinTypeSymbol class.

ASIDE

The double underscore or dunder (as in “Double UNDERscore”) methods __str__ and __repr__ are special Python methods. We’ve defined them to have a nice formatted message when we print a symbol object to standard output.

By the way, built-in types are the reason why the type parameter in the Scope class constructor is an optional parameter.

Here is our symbol class hierarchy so far:

Let’s play with the builtin types in a Python shell. Download the interpreter file and save it as spi.py; launch a python shell from the same directory where you saved the spi.py file, and play with the class we’ve just defined interactively:

$ python
>>> from spi import BuiltinTypeSymbol
>>> int_type = BuiltinTypeSymbol('integer')
>>> int_type
<BuiltinTypeSymbol(name='integer')>
>>>
>>> real_type = BuiltinTypeSymbol('real')
>>> real_type
<BuiltinTypeSymbol(name='real')>

That’s all there is to built-in type symbols for now. Now back to our variable symbols.

How can we represent them in code? Let’s create a VarSymbol class:

class VarSymbol(Symbol):
    def __init__(self, name, type):
        super(VarSymbol, self).__init__(name, type)

    def __str__(self):
        return "<{class_name}(name='{name}', type='{type}')>".format(
            class_name=self.__class__.__name__,
            name=self.name,
            type=self.type,
        )

    __repr__ = __str__

In this class, we made both the name and the type parameters required and the class name VarSymbol clearly indicates that an instance of the class will identify a variable symbol (the category is variable). The type parameter is an instance of the BuiltinTypeSymbol class.

Let’s go back to the interactive Python shell to see how we can manually construct instances of our variable symbols now that we know how to construct BuiltinTypeSymbol class instances:

$ python
>>> from spi import BuiltinTypeSymbol, VarSymbol
>>> int_type = BuiltinTypeSymbol('integer')
>>> real_type = BuiltinTypeSymbol('real')
>>>
>>> var_x_symbol = VarSymbol('x', int_type)
>>> var_x_symbol
<VarSymbol(name='x', type='integer')>
>>>
>>> var_y_symbol = VarSymbol('y', real_type)
>>> var_y_symbol
<VarSymbol(name='y', type='real')>
>>>

As you can see, we first create an instance of the built-in type symbol and then pass it as a second parameter to VarSymbol‘s constructor: variable symbols must have both a name and type associated with them as you’ve seen in various variable declarations like var x : integer;

And here is the complete hierarchy of symbols we’ve defined so far, in visual form:


Okay, now onto answering the question “Where and how should we store the collected information?”

Now that we have all the symbols representing all our variable declarations, where should we store those symbols so that we can search for them later when we encounter variable references (names)?

The answer is, as you probably already know, in the symbol table.

What is a symbol table? A symbol table is an abstract data type for tracking various symbols in source code. Think of it as a dictionary where the key is the symbol’s name and the value is an instance of the symbol class (or one of its subclasses). To represent the symbol table in code we’ll use a dedicated class for it aptly named SymbolTable. :) To store symbols in the symbol table we’ll add the insert method to our symbol table class. The method insert will take a symbol as a parameter and store it internally in the _symbols ordered dictionary using the symbol’s name as a key and the symbol instance as a value:

class SymbolTable(object):
    def __init__(self):
        self._symbols = OrderedDict()

    def __str__(self):
        symtab_header = 'Symbol table contents'
        lines = ['\n', symtab_header, '_' * len(symtab_header)]
        lines.extend(
            ('%7s: %r' % (key, value))
            for key, value in self._symbols.items()
        )
        lines.append('\n')
        s = '\n'.join(lines)
        return s

    __repr__ = __str__

    def insert(self, symbol):
        print('Insert: %s' % symbol.name)
        self._symbols[symbol.name] = symbol

Let’s manually populate our symbol table for the following sample program. Because we don’t know how to search our symbol table yet, our program won’t contain any variable references, only variable declarations:

program SymTab1;
   var x, y : integer;

begin

end.

Download symtab01.py, which contains our new SymbolTable class and run it on the command line. This is what the output looks like for our program above:

$ python symtab01.py
Insert: INTEGER
Insert: x
Insert: y


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


And now let’s build and populate the symbol table manually in a Python shell:

$ python
>>> from symtab01 import SymbolTable, BuiltinTypeSymbol, VarSymbol
>>> symtab = SymbolTable()
>>> int_type = BuiltinTypeSymbol('INTEGER')
>>> # now let's store the built-in type symbol in the symbol table
...
>>> symtab.insert(int_type)
Insert: INTEGER
>>>
>>> symtab


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>


>>> var_x_symbol = VarSymbol('x', int_type)
>>> symtab.insert(var_x_symbol)
Insert: x
>>> symtab


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
      x: <VarSymbol(name='x', type='INTEGER')>


>>> var_y_symbol = VarSymbol('y', int_type)
>>> symtab.insert(var_y_symbol)
Insert: y
>>> symtab


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>


>>>


At this point we have answers to two questions that we asked earlier:

  • A. What information about variables do we need to collect?

    Name, category, and type. And we use symbols to hold that information.

  • B. Where and how should we store the collected information?

    We store collected symbols in the symbol table by using its insert method.


Now let’s find the answer to our third question: ‘How do we implement the “go over all variable declarations” step?’

This is a really easy one. Because we already have an AST built by our parser, we just need to create a new AST visitor class that will be responsible for walking over the tree and doing different actions when visiting VarDecl AST nodes!

Now we have answers to all three questions:

  • A. What information about variables do we need to collect?

    Name, category, and type. And we use symbols to hold that information.

  • B. Where and how should we store the collected information?

    We store collected symbols in the symbol table by using its insert method.

  • C. How do we implement the “go over all variable declarations” step?

    We will create a new AST visitor that will do some actions on visiting VarDecl AST nodes.


Let’s create a new tree visitor class and give it the name SemanticAnalyzer. Take a look the following sample program, for example:

program SymTab2;
   var x, y : integer;

begin

end.

To be able to analyze the program above, we don’t need to implement all visit_xxx methods, just a subset of them. Below is the skeleton for the SemanticAnalyzer class with enough visit_xxx methods to be able to successfully walk the AST of the sample program above:

class SemanticAnalyzer(NodeVisitor):
    def __init__(self):
        self.symtab = SymbolTable()

    def visit_Block(self, node):
        for declaration in node.declarations:
            self.visit(declaration)
        self.visit(node.compound_statement)

    def visit_Program(self, node):
        self.visit(node.block)

    def visit_Compound(self, node):
        for child in node.children:
            self.visit(child)

    def visit_NoOp(self, node):
        pass

    def visit_VarDecl(self, node):
        #  Actions go here
        pass

Now, we have all the pieces to implement the first three steps of our algorithm for our first static semantic check, the check that verifies that variables are declared before they are used.

Here are the steps of the algorithm again:

  1. Go over all variable declarations
  2. For every variable declaration you encounter, collect all necessary information about the declared variable
  3. Store the collected information in some stash for future references by using the variable’s name as a key
  4. When you see a variable reference such as in the assignment statement x := x + y, search the stash by the variable’s name to see if the stash has any information about the variable. If it does, the variable has been declared. If it doesn’t, the variable hasn’t been declared yet, which is a semantic error.


Let’s implement those steps. Actually, the only thing that we need to do is fill in the visit_VarDecl method of the SemanticAnalyzer class. Here it is, filled in:

def visit_VarDecl(self, node):
    # For now, manually create a symbol for the INTEGER built-in type
    # and insert the type symbol in the symbol table.
    type_symbol = BuiltinTypeSymbol('INTEGER')
    self.symtab.insert(type_symbol)

    # We have all the information we need to create a variable symbol.
    # Create the symbol and insert it into the symbol table.
    var_name = node.var_node.value
    var_symbol = VarSymbol(var_name, type_symbol)
    self.symtab.insert(var_symbol)

If you look at the contents of the method, you can see that it actually incorporates all three steps:

  1. The method will be called for every variable declaration once we’ve invoked the visit method of the SemanticAnalyzer instance. That covers Step 1 of the algorithm: “Go over all variable declarations”
  2. For every variable declaration, the method visit_VarDecl will collect the necessary information and create a variable symbol instance. That covers Step 2 of the algorithm: “For every variable declaration you encounter, collect all necessary information about the declared variable”
  3. The method visit_VarDecl will store the collected information about the variable declaration in the symbol table using the symbol table’s insert method. This covers Step 3 of the algorithm: “Store the collected information in some stash for future references by using the variable’s name as a key”

To see all of those steps in action, download file symtab02.py and study its source code first. Then run it on the command line and inspect the output:

$ python symtab02.py
Insert: INTEGER
Insert: x
Insert: INTEGER
Insert: y


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>

You might have noticed that there are two lines that say Insert: INTEGER. We will fix this situation in the following section where we’ll discuss the implementation of the final step (Step 4) of the semantic check algorithm.


Okay, let’s implement Step 4 of our algorithm. Here is an updated version of Step 4 to reflect the introduction of symbols and the symbol table: When you see a variable reference (name) such as in the assignment statement x := x + y, search the symbol table by the variable’s name to see if the table has a variable symbol associated with the name. If it does, the variable has been declared. If it doesn’t, the variable hasn’t been declared yet, which is a semantic error.

To implement Step 4, we need to make some changes to the symbol table and semantic analyzer:

  1. We need to add a method to our symbol table that will be able to look up a symbol by name.
  2. We need to update our semantic analyzer to look up a name in the symbol table every time it encounters a variable reference.

First, let’s update our SymbolTable class by adding the lookup method that will be responsible for searching for a symbol by name. In other words, the lookup method will be responsible for resolving a variable name (a variable reference) to its declaration. The process of mapping a variable reference to its declaration is called name resolution. And here is our lookup method that does just that, name resolution:

def lookup(self, name):
    print('Lookup: %s' % name)
    symbol = self._symbols.get(name)
    # 'symbol' is either an instance of the Symbol class or None
    return symbol

The method takes a symbol name as a parameter and returns a symbol if it finds it or None if it doesn’t. As simple as that.

While we’re at it, let’s also update our SymbolTable class to initialize built-in types. We’ll do that by adding a method _init_builtins and calling it in the SymbolTable‘s constructor. The _init_builtins method will insert a type symbol for integer and a type symbol for real into the symbol table.

Here is the full code for our updated SymbolTable class:

class SymbolTable(object):
    def __init__(self):
        self._symbols = OrderedDict()
        self._init_builtins()

    def _init_builtins(self):
        self.insert(BuiltinTypeSymbol('INTEGER'))
        self.insert(BuiltinTypeSymbol('REAL'))

    def __str__(self):
        symtab_header = 'Symbol table contents'
        lines = ['\n', symtab_header, '_' * len(symtab_header)]
        lines.extend(
            ('%7s: %r' % (key, value))
            for key, value in self._symbols.items()
        )
        lines.append('\n')
        s = '\n'.join(lines)
        return s

    __repr__ = __str__

    def insert(self, symbol):
        print('Insert: %s' % symbol.name)
        self._symbols[symbol.name] = symbol

    def lookup(self, name):
        print('Lookup: %s' % name)
        symbol = self._symbols.get(name)
        # 'symbol' is either an instance of the Symbol class or None
        return symbol


Now that we have built-in type symbols and the lookup method to search our symbol table when we encounter variable names (and other names like type names), let’s update the SemanticAnalyzer‘s visit_VarDecl method and replace the two lines where we were manually creating the INTEGER built-in type symbol and manually inserting it into the symbol table with code to look up the INTEGER type symbol.

The change will also fix the issue with that double output of the Insert: INTEGER line we’ve seen before.

Here is the visit_VarDecl method before the change:

def visit_VarDecl(self, node):
    # For now, manually create a symbol for the INTEGER built-in type
    # and insert the type symbol in the symbol table.
    type_symbol = BuiltinTypeSymbol('INTEGER')
    self.symtab.insert(type_symbol)

    # We have all the information we need to create a variable symbol.
    # Create the symbol and insert it into the symbol table.
    var_name = node.var_node.value
    var_symbol = VarSymbol(var_name, type_symbol)
    self.symtab.insert(var_symbol)

and after the change:

def visit_VarDecl(self, node):
    type_name = node.type_node.value
    type_symbol = self.symtab.lookup(type_name)

    # We have all the information we need to create a variable symbol.
    # Create the symbol and insert it into the symbol table.
    var_name = node.var_node.value
    var_symbol = VarSymbol(var_name, type_symbol)
    self.symtab.insert(var_symbol)

Let’s apply the changes to the familiar Pascal program that has only variable declarations:

program SymTab3;
   var x, y : integer;

begin

end.

Download the symtab03.py file that has all the changes we’ve just discussed, run it on the command line, and see that there is no longer a duplicate Insert: INTEGER line in the program output any more:

$ python symtab03.py
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: x
Lookup: INTEGER
Insert: y


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>

You can also see in the output above that our semantic analyzer looks up the INTEGER built-in type twice: first for the declaration of the variable x, and the second time for the declaration of the variable y.


Now let’s switch our attention to variable references (names) and how we can resolve a variable name, let’s say in an arithmetic expression, to its variable declaration (variable symbol). Let’s take a look at the following sample program, for example, that has an assignment statement x := x + y; with three variable references: x, another x, and y:

program SymTab4;
    var x, y : integer;

begin
    x := x + y;
end.

We already have the lookup method in our symbol table implementation. What we need to do now is extend our semantic analyzer so that every time it encounters a variable reference it would search the symbol table by the variable reference name using the symbol table’s lookup name. What method of the SemanticAnalyzer gets called every time a variable reference is encountered when the analyzer walks the AST? It’s the method visit_Var. Let’s add it to our class. It’s very simple: all it does is look up the variable symbol by name:

def visit_Var(self, node):
    var_name = node.value
    var_symbol = self.symtab.lookup(var_name)


Because our sample program SymTab4 has an assignment statement with arithmetic addition in its right hand side, we need to add two more methods to our SemanticAnalyzer so that it could actually walk the AST of the SymTab4 program and call the visit_Var method for all Var nodes. The methods we need to add are visit_Assign and visit_BinOp. They are nothing new: you’ve seen these methods before. Here they are:

def visit_Assign(self, node):
    # right-hand side
    self.visit(node.right)
    # left-hand side
    self.visit(node.left)

def visit_BinOp(self, node):
    self.visit(node.left)
    self.visit(node.right)


You can find the full source code with the changes we’ve just discussed in the file symtab04.py. Download the file, run it on the command line, and inspect the output produced for our sample program SymTab4 with an assignment statement.

Here is the output on my laptop:

$ python symtab04.py
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: x
Lookup: INTEGER
Insert: y
Lookup: x
Lookup: y
Lookup: x


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>

Spend some time analyzing the output and making sure you understand how and why the output is generated in that order.

At this point, we have implemented all of the steps of our algorithm for a static semantic check that verifies that all variables in the program are declared before they are used!

Semantic errors

So far we’ve looked at the programs that had their variables declared, but what if our program has a variable reference that doesn’t resolve to any declaration; that is, it’s not declared? That’s a semantic error and we need to extend our semantic analyzer to signal that error.

Take a look at the following semantically incorrect program, where the variable y is not declared but used in the assignment statement:

program SymTab5;
    var x : integer;

begin
    x := y;
end.

To signal the error, we need to modify our SemanticAnalyzer‘s visit_Var method to throw an exception if the lookup method cannot resolve a name to a symbol and returns None. Here is the updated code for visit_Var:

def visit_Var(self, node):
    var_name = node.value
    var_symbol = self.symtab.lookup(var_name)
    if var_symbol is None:
        raise Exception(
            "Error: Symbol(identifier) not found '%s'" % var_name
        )

Download symtab05.py, run it on the command line, and see what happens:

$ python symtab05.py
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Insert: x
Lookup: y
Error: Symbol(identifier) not found 'y'


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='INTEGER')>

You can see the error message Error: Symbol(identifier) not found ‘y’ and the contents of the symbol table.

Congratulations on finishing the current version of our semantic analyzer that can statically check if variables in a program are declared before they are used, and if they are not, throws an exception indicating a semantic error!

Let’s pause for a second and celebrate this important milestone. Okay, the second is over and we need to move on to another static semantic check. For fun and profit let’s extend our semantic analyzer to check for duplicate identifiers in declarations.

Let’s take a look at the following program, SymTab6:

program SymTab6;
   var x, y : integer;
   var y : real;
begin
   x := x + y;
end.

Variable y has been declared twice: the first time as integer and the second time as real.

To catch that semantic error we need to modify our visit_VarDecl method to check whether the symbol table already has a symbol with the same name before inserting a new symbol. Here is our new version of the method:

def visit_VarDecl(self, node):
    type_name = node.type_node.value
    type_symbol = self.symtab.lookup(type_name)

    # We have all the information we need to create a variable symbol.
    # Create the symbol and insert it into the symbol table.
    var_name = node.var_node.value
    var_symbol = VarSymbol(var_name, type_symbol)

    # Signal an error if the table alrady has a symbol
    # with the same name
    if self.symtab.lookup(var_name) is not None:
        raise Exception(
            "Error: Duplicate identifier '%s' found" % var_name
        )

    self.symtab.insert(var_symbol)


File symtab06.py has all the changes. Download it and run it on the command line:

$ python symtab06.py
Insert: INTEGER
Insert: REAL
Lookup: INTEGER
Lookup: x
Insert: x
Lookup: INTEGER
Lookup: y
Insert: y
Lookup: REAL
Lookup: y
Error: Duplicate identifier 'y' found


Symbol table contents
_____________________
INTEGER: <BuiltinTypeSymbol(name='INTEGER')>
   REAL: <BuiltinTypeSymbol(name='REAL')>
      x: <VarSymbol(name='x', type='INTEGER')>
      y: <VarSymbol(name='y', type='INTEGER')>

Study the output and the contents of the symbol table. Make sure you understand what’s going on.


Summary

Let’s quickly recap what we learned today:

  • We learned more about symbols, symbol tables, and semantic analysis in general
  • We learned about name resolution and how the semantic analyzer resolves names to their declarations
  • We learned how to code a semantic analyzer that walks an AST, builds the symbol table, and does basic semantic checks


And, as a reminder, the structure of our interpreter now looks like this:


We’re done with semantic checks for today and we’re finally ready to tackle the topic of scopes, how they relate to symbol tables, and the topic of semantic checks in the presence of nested scopes. Those will be central topics of the next article. Stay tuned and see you soon!


All articles in this series:


Let’s Build A Simple Interpreter. Part 12.

Date

Be not afraid of going slowly; be afraid only of standing still.” - Chinese proverb.

Hello, and welcome back!

Today we are going to take a few more baby steps and learn how to parse Pascal procedure declarations.

What is a procedure declaration? A procedure declaration is a language construct that defines an identifier (a procedure name) and associates it with a block of Pascal code.

Before we dive in, a few words about Pascal procedures and their declarations:

  • Pascal procedures don’t have return statements. They exit when they reach the end of their corresponding block.
  • Pascal procedures can be nested within each other.
  • For simplicity reasons, procedure declarations in this article won’t have any formal parameters. But, don’t worry, we’ll cover that later in the series.

This is our test program for today:

PROGRAM Part12;
VAR
   a : INTEGER;

PROCEDURE P1;
VAR
   a : REAL;
   k : INTEGER;

   PROCEDURE P2;
   VAR
      a, z : INTEGER;
   BEGIN {P2}
      z := 777;
   END;  {P2}

BEGIN {P1}

END;  {P1}

BEGIN {Part12}
   a := 10;
END.  {Part12}

As you can see above, we have defined two procedures (P1 and P2) and P2 is nested within P1. In the code above, I used comments with a procedure’s name to clearly indicate where the body of every procedure begins and where it ends.

Our objective for today is pretty clear: learn how to parse a code like that.


First, we need to make some changes to our grammar to add procedure declarations. Well, let’s just do that!

Here is the updated declarations grammar rule:

The procedure declaration sub-rule consists of the reserved keyword PROCEDURE followed by an identifier (a procedure name), followed by a semicolon, which in turn is followed by a block rule, which is terminated by a semicolon. Whoa! This is a case where I think the picture is actually worth however many words I just put in the previous sentence! :)

Here is the updated syntax diagram for the declarations rule:

From the grammar and the diagram above you can see that you can have as many procedure declarations on the same level as you want. For example, in the code snippet below we define two procedure declarations, P1 and P1A, on the same level:

PROGRAM Test;
VAR
   a : INTEGER;

PROCEDURE P1;
BEGIN {P1}

END;  {P1}

PROCEDURE P1A;
BEGIN {P1A}

END;  {P1A}

BEGIN {Test}
   a := 10;
END.  {Test}

The diagram and the grammar rule above also indicate that procedure declarations can be nested because the procedure declaration sub-rule references the block rule which contains the declarations rule, which in turn contains the procedure declaration sub-rule. As a reminder, here is the syntax diagram and the grammar for the block rule from Part10:


Okay, now let’s focus on the interpreter components that need to be updated to support procedure declarations:

Updating the Lexer

All we need to do is add a new token named PROCEDURE:

PROCEDURE = 'PROCEDURE'

And add PROCEDURE to the reserved keywords. Here is the complete mapping of reserved keywords to tokens:

RESERVED_KEYWORDS = {
    'PROGRAM': Token('PROGRAM', 'PROGRAM'),
    'VAR': Token('VAR', 'VAR'),
    'DIV': Token('INTEGER_DIV', 'DIV'),
    'INTEGER': Token('INTEGER', 'INTEGER'),
    'REAL': Token('REAL', 'REAL'),
    'BEGIN': Token('BEGIN', 'BEGIN'),
    'END': Token('END', 'END'),
    'PROCEDURE': Token('PROCEDURE', 'PROCEDURE'),
}


Updating the Parser

Here is a summary of the parser changes:

  1. New ProcedureDecl AST node
  2. Update to the parser’s declarations method to support procedure declarations

Let’s go over the changes.

  1. The ProcedureDecl AST node represents a procedure declaration. The class constructor takes as parameters the name of the procedure and the AST node of the block of code that the procedure’s name refers to.

    class ProcedureDecl(AST):
        def __init__(self, proc_name, block_node):
            self.proc_name = proc_name
            self.block_node = block_node
    
  2. Here is the updated declarations method of the Parser class

    def declarations(self):
        """declarations : VAR (variable_declaration SEMI)+
                        | (PROCEDURE ID SEMI block SEMI)*
                        | empty
        """
        declarations = []
    
        if self.current_token.type == VAR:
            self.eat(VAR)
            while self.current_token.type == ID:
                var_decl = self.variable_declaration()
                declarations.extend(var_decl)
                self.eat(SEMI)
    
        while self.current_token.type == PROCEDURE:
            self.eat(PROCEDURE)
            proc_name = self.current_token.value
            self.eat(ID)
            self.eat(SEMI)
            block_node = self.block()
            proc_decl = ProcedureDecl(proc_name, block_node)
            declarations.append(proc_decl)
            self.eat(SEMI)
    
        return declarations
    

    Hopefully, the code above is pretty self-explanatory. It follows the grammar/syntax diagram for procedure declarations that you’ve seen earlier in the article.


Updating the SymbolTable builder

Because we’re not ready yet to handle nested procedure scopes, we’ll simply add an empty visit_ProcedureDecl method to the SymbolTreeBuilder AST visitor class. We’ll fill it out in the next article.

def visit_ProcedureDecl(self, node):
    pass


Updating the Interpreter

We also need to add an empty visit_ProcedureDecl method to the Interpreter class, which will cause our interpreter to silently ignore all our procedure declarations.

So far, so good.


Now that we’ve made all the necessary changes, let’s see what the Abstract Syntax Tree looks like with the new ProcedureDecl nodes.

Here is our Pascal program again (you can download it directly from GitHub):

PROGRAM Part12;
VAR
   a : INTEGER;

PROCEDURE P1;
VAR
   a : REAL;
   k : INTEGER;

   PROCEDURE P2;
   VAR
      a, z : INTEGER;
   BEGIN {P2}
      z := 777;
   END;  {P2}

BEGIN {P1}

END;  {P1}

BEGIN {Part12}
   a := 10;
END.  {Part12}


Let’s generate an AST and visualize it with the genastdot.py utility:

$ python genastdot.py part12.pas > ast.dot && dot -Tpng -o ast.png ast.dot

In the picture above you can see two ProcedureDecl nodes: ProcDecl:P1 and ProcDecl:P2 that correspond to procedures P1 and P2. Mission accomplished. :)

As a last item for today, let’s quickly check that our updated interpreter works as before when a Pascal program has procedure declarations in it. Download the interpreter and the test program if you haven’t done so yet, and run it on the command line. Your output should look similar to this:

$ python spi.py part12.pas
Define: INTEGER
Define: REAL
Lookup: INTEGER
Define: <a:INTEGER>
Lookup: a

Symbol Table contents:
Symbols: [INTEGER, REAL, <a:INTEGER>]

Run-time GLOBAL_MEMORY contents:
a = 10


Okay, with all that knowledge and experience under our belt, we’re ready to tackle the topic of nested scopes that we need to understand in order to be able to analyze nested procedures and prepare ourselves to handle procedure and function calls. And that’s exactly what we are going to do in the next article: dive deep into nested scopes. So don’t forget to bring your swimming gear next time! Stay tuned and see you soon!


All articles in this series: