In this post we'll cover building a type parser and integrating it with our AST. We'll also introduce the type checker and its role, and finally, we'll explore how symbol tables help us keep track of everything.
To illustrate this process, let's look at a simple example: we'll define a variable a
of type int
and assign it the value 100
.
a: int = 100;
The parseType
function is a Pratt parser, similar to the expression
parser we built in the last post. It takes a ParserApi
object and a SymbolTable
as arguments and returns the parsed type as an AST node. Our parser for now only needs to worry about the ident token.
export function parseType(
api: ParserApi<ScannerToken>,
symbolTable: SymbolTable,
) {
/**
* This helper function retrieves a symbol from the symbol table based on its name.
* It throws an error if the symbol is not found and adds a reference to the symbol.
*/
function expectSymbol(name: string, tk: Token<'ident'>) {
const symbol = symbolTable.get(name);
if (!symbol) throw api.error('Type not defined', tk);
const node: NodeMap['ident'] = { ...tk, symbol };
(symbol.references ||= []).push(node);
return node;
}
const parser = parserTable<NodeMap, ScannerToken>(
({ expect, expression, expectNode }) => ({
ident: {
prefix(n) {
return expectSymbol(text(n), n);
},
},
}),
);
return parser(api);
}
Our AST nodes follow a simple design principle: if a node can contain other nodes, it must have a children
property, which serves as an array holding all nodes parsed for the parent node. If a node contains more than two nodes, it should include dedicated properties for each of them. For instance, consider the def
node:
// node.ts
...
def: {
children: [NodeMap['ident'], Node] | [NodeMap['ident'], Node, Node];
left: NodeMap['ident'];
right: Node;
type?: Node;
};
...
The def
node represents a variable or function definition. In the example, a: int = 100;
, the left
property corresponds to the identifier a
, the right
property represents the initial value 100
, and the type
property holds the type declaration int
. The children
property will contain these three nodes: left
(identifier), type
, and right
(value), in that specific order.
We use the NodeMap['ident']
type to ensure that the left
operand is an identifier, allowing TypeScript to verify the correctness of our parser's node generation.
In the next section, we'll see how our parser generates the def
node with its left
, right
, and type
properties.
The definition
function lets us handle variable and function definitions, like a: int = 100
. These definitions are considered top-level statements within a function and can't be tucked inside other expressions.
/**
* Parses a definition statement. A definition statement can be in the form of:
* - `identifier = expression`
* - `var identifier = expression`
* - `identifier : type = expression`
* - `var identifier : type = expression`
*
* If the statement does not match a definition pattern, it returns undefined.
* This allows the caller to fallback to parsing a general expression.
*/
function definition(): NodeMap['def'] | undefined {
let tk = current();
let flags = 0;
if (tk.kind !== 'ident' && tk.kind !== 'var') return;
api.next();
if (tk.kind === 'var') {
flags = Flags.Variable;
tk = expect('ident');
}
const next = optional(':') || optional('=');
if (!next) {
if (flags) throw api.error('Expected definition', tk);
api.backtrack(tk);
return;
}
// Check if a symbol with this identifier already exists in the symbol table.
// If it does, and we are not explicitly defining a variable (with 'var'),
// this is likely an assignment expression, not a definition.
if (next.kind === '=' && !flags && symbolTable.get(text(tk))) {
api.backtrack(tk);
return;
}
const left = define(tk, flags);
let type;
if (next.kind === ':') {
type = expectNode(typeParser(), 'Expected type');
expect('=');
}
const right = expectNode(exprParser(), 'Expected expression');
return {
kind: 'def',
children: type ? [left, type, right] : [left, right],
left,
right,
type,
start: left.start,
end: right.end,
source: left.source,
line: left.line,
};
}
/**
* This is the entry point for parsing a statement.
* It first attempts to parse a definition (e.g., `var x = 5` or `x: number = 5`).
* If a definition is not found, it falls back to parsing a general expression.
*/
function statement() {
return definition() || exprParser();
}
This parser takes our a: int = 100
example and transforms it into an AST node. This node represents the variable definition, complete with its type and assigned value. The parser checks for the presence of the ':' token. If it's there, the parser calls our parseType
function to handle the type declaration and assigns the result to the type
property. When a type is present, the children
property of the resulting node will contain three child nodes: the identifier, the type, and the value.
The definition
function needs to be smart about what it considers a definition. For example, in an assignment like x = 10
, x
might already exist. We don't want to treat this as a new definition, but rather as an assignment to an existing variable.
To handle this, we added backtracking to our parser. If the definition
function encounters an identifier that already exists in the symbol table and the var
keyword isn't used, it backtracks. This allows the parser to re-evaluate the ident
token as part of an expression.
The backtrack
function in the scanner rewinds the input to a previous position, essentially "undoing" the scan. In the ParserApi
, the backtrack
function resets both the scanner and the current token.
// Scanner
function backtrack(pos: Position) {
index = pos.end;
endLine = line = pos.line;
}
// ParserApi (sdk/index.ts)
function backtrack(pos: Node) {
scan.backtrack(pos);
token = pos;
}
The symbol table keeps track of information about the program's symbols, including their types, scope, and where they're referenced. We'll use two separate symbol tables: one specifically for types and another for general symbols.
Our definition
parser uses the symbol table to create a new symbol for the variable a
. However, it doesn't assign a type to this symbol just yet, that's a job for our type checker. The parseType
function relies on the type symbol table. It checks to make sure that the type specified by the ident
node (in our example, int
) actually exists.
Our implementation uses a SymbolTable
function which can be initialized with a set of global symbols. We then define two specific symbol table instances: ProgramSymbolTable
and TypesSymbolTable
. The ProgramSymbolTable
will hold symbols for things like built-in constants and functions, while the TypesSymbolTable
will store information about available data types, like int
, bool
, etc.
export function SymbolTable(globals?: Symbol[]) {
const st = BaseSymbolTable<Symbol>();
if (globals) st.setSymbols(...globals);
return {
...st,
getRef(id: string, node: Node) {
const symbol = st.get(id);
if (symbol) {
(symbol.references ||= []).push(node);
}
return symbol;
},
};
}
export function ProgramSymbolTable() {
return SymbolTable([
// ...
{ name: 'true', kind: 'literal', flags: 0, type: BooleanType },
// ...
]);
}
export function TypesSymbolTable() {
return SymbolTable([
// ...
{ name: 'int', kind: 'type', flags: 0, type: IntegerType },
// ...
]);
}
With the AST in hand, we shift our focus to semantic analysis. This is where the type checker steps in. The type checker is responsible for ensuring the code's correctness by verifying type compatibility and identifying potential errors. We'll implement it as a separate phase, distinct from the parser, allowing us to reuse the parser in different contexts, such as a syntax highlighter or code formatter.
Our checker
function traverses the AST, resolving and assigning types to nodes based on their context and declarations.
export function checker({ root }: { root: Node; errors: CompilerError[] }) {
function resolveType(node: Node) {
if (node.kind === 'ident') {
const name = text(node);
const native = nativeTypeMap[name];
if (native) return native;
}
}
function getNodeType(node: CheckedNode) {
if (node[typeSymbol]) return node[typeSymbol];
if (node.kind === 'def') {
const type = node.type && resolveType(node.type);
if (type) node.left.symbol.type = type;
}
}
function checkEach(node: Node[]) {
node.forEach(check);
}
function check(node: Node) {
switch (node.kind) {
case 'root':
case 'main':
return checkEach(node.children);
case 'def':
getNodeType(node);
}
}
return {
run: () => check(root),
getNodeType,
};
}
When the checker
function hits our def
node, it calls getNodeType
. This function is designed to figure out and assign the correct type to a node. In our case, it sees the int
type declaration and uses the resolveType
function to grab the actual int
type object. Finally, it attaches this type information to the symbol representing our variable a
. This way, we know a
is supposed to be an integer.
This post covers parsing type annotations within variable definitions, integrating the parser with an AST, managing statements and backtracking, introducing the type checker, and exploring the role of symbol tables.
In future posts we'll explore more complex type constructs, such as function types, array types, and custom types. Stay tuned.