Writing a Compiler: Parsing Types - Part 1 Writing a Compiler: Parsing Types - Part 1compiler parsing type-checker

In this post we'll cover building a type parser and integrating it with our AST. We'll also introduce the type checker and its role, and finally, we'll explore how symbol tables help us keep track of everything.

To illustrate this process, let's look at a simple example: we'll define a variable a of type int and assign it the value 100.

a: int = 100;

Implementing the Type Parser

The parseType function is a Pratt parser, similar to the expression parser we built in the last post. It takes a ParserApi object and a SymbolTable as arguments and returns the parsed type as an AST node. Our parser for now only needs to worry about the ident token.

export function parseType(
	api: ParserApi<ScannerToken>,
	symbolTable: SymbolTable,
) {
	/**
	 * This helper function retrieves a symbol from the symbol table based on its name.
	 * It throws an error if the symbol is not found and adds a reference to the symbol.
	 */
	function expectSymbol(name: string, tk: Token<'ident'>) {
		const symbol = symbolTable.get(name);
		if (!symbol) throw api.error('Type not defined', tk);
		const node: NodeMap['ident'] = { ...tk, symbol };
		(symbol.references ||= []).push(node);
		return node;
	}

	const parser = parserTable<NodeMap, ScannerToken>(
		({ expect, expression, expectNode }) => ({
			ident: {
				prefix(n) {
					return expectSymbol(text(n), n);
				},
			},
		}),
	);
	return parser(api);
}

Generating the AST

Our AST nodes follow a simple design principle: if a node can contain other nodes, it must have a children property, which serves as an array holding all nodes parsed for the parent node. If a node contains more than two nodes, it should include dedicated properties for each of them. For instance, consider the def node:

// node.ts
...
def: {
	children: [NodeMap['ident'], Node] | [NodeMap['ident'], Node, Node];
	left: NodeMap['ident'];
	right: Node;
	type?: Node;
};
...

The def node represents a variable or function definition. In the example, a: int = 100;, the left property corresponds to the identifier a, the right property represents the initial value 100, and the type property holds the type declaration int. The children property will contain these three nodes: left (identifier), type, and right (value), in that specific order.

We use the NodeMap['ident'] type to ensure that the left operand is an identifier, allowing TypeScript to verify the correctness of our parser's node generation.

In the next section, we'll see how our parser generates the def node with its left, right, and type properties.

Parsing Definition Statements

The definition function lets us handle variable and function definitions, like a: int = 100. These definitions are considered top-level statements within a function and can't be tucked inside other expressions.

/**
 * Parses a definition statement. A definition statement can be in the form of:
 *  - `identifier = expression`
 *  - `var identifier = expression`
 *  - `identifier : type = expression`
 *  - `var identifier : type = expression`
 *
 * If the statement does not match a definition pattern, it returns undefined.
 * This allows the caller to fallback to parsing a general expression.
 */
function definition(): NodeMap['def'] | undefined {
	let tk = current();
	let flags = 0;
	if (tk.kind !== 'ident' && tk.kind !== 'var') return;
	api.next();
	if (tk.kind === 'var') {
		flags = Flags.Variable;
		tk = expect('ident');
	}
	const next = optional(':') || optional('=');
	if (!next) {
		if (flags) throw api.error('Expected definition', tk);
		api.backtrack(tk);
		return;
	}

	// Check if a symbol with this identifier already exists in the symbol table.
	// If it does, and we are not explicitly defining a variable (with 'var'),
	// this is likely an assignment expression, not a definition.
	if (next.kind === '=' && !flags && symbolTable.get(text(tk))) {
		api.backtrack(tk);
		return;
	}

	const left = define(tk, flags);
	let type;
	if (next.kind === ':') {
		type = expectNode(typeParser(), 'Expected type');
		expect('=');
	}
	const right = expectNode(exprParser(), 'Expected expression');
	return {
		kind: 'def',
		children: type ? [left, type, right] : [left, right],
		left,
		right,
		type,
		start: left.start,
		end: right.end,
		source: left.source,
		line: left.line,
	};
}

/**
 * This is the entry point for parsing a statement.
 * It first attempts to parse a definition (e.g., `var x = 5` or `x: number = 5`).
 * If a definition is not found, it falls back to parsing a general expression.
 */
function statement() {
	return definition() || exprParser();
}

This parser takes our a: int = 100 example and transforms it into an AST node. This node represents the variable definition, complete with its type and assigned value. The parser checks for the presence of the ':' token. If it's there, the parser calls our parseType function to handle the type declaration and assigns the result to the type property. When a type is present, the children property of the resulting node will contain three child nodes: the identifier, the type, and the value.

Backtracking

The definition function needs to be smart about what it considers a definition. For example, in an assignment like x = 10, x might already exist. We don't want to treat this as a new definition, but rather as an assignment to an existing variable.

To handle this, we added backtracking to our parser. If the definition function encounters an identifier that already exists in the symbol table and the var keyword isn't used, it backtracks. This allows the parser to re-evaluate the ident token as part of an expression.

The backtrack function in the scanner rewinds the input to a previous position, essentially "undoing" the scan. In the ParserApi, the backtrack function resets both the scanner and the current token.

// Scanner
function backtrack(pos: Position) {
	index = pos.end;
	endLine = line = pos.line;
}

// ParserApi (sdk/index.ts)
function backtrack(pos: Node) {
	scan.backtrack(pos);
	token = pos;
}

Symbol Table

The symbol table keeps track of information about the program's symbols, including their types, scope, and where they're referenced. We'll use two separate symbol tables: one specifically for types and another for general symbols.

Our definition parser uses the symbol table to create a new symbol for the variable a. However, it doesn't assign a type to this symbol just yet, that's a job for our type checker. The parseType function relies on the type symbol table. It checks to make sure that the type specified by the ident node (in our example, int) actually exists.

Our implementation uses a SymbolTable function which can be initialized with a set of global symbols. We then define two specific symbol table instances: ProgramSymbolTable and TypesSymbolTable. The ProgramSymbolTable will hold symbols for things like built-in constants and functions, while the TypesSymbolTable will store information about available data types, like int, bool, etc.

export function SymbolTable(globals?: Symbol[]) {
	const st = BaseSymbolTable<Symbol>();

	if (globals) st.setSymbols(...globals);

	return {
		...st,
		getRef(id: string, node: Node) {
			const symbol = st.get(id);
			if (symbol) {
				(symbol.references ||= []).push(node);
			}
			return symbol;
		},
	};
}

export function ProgramSymbolTable() {
	return SymbolTable([
		// ...
		{ name: 'true', kind: 'literal', flags: 0, type: BooleanType },
		// ...
	]);
}

export function TypesSymbolTable() {
	return SymbolTable([
		// ...
		{ name: 'int', kind: 'type', flags: 0, type: IntegerType },
		// ...
	]);
}

Type Checker

With the AST in hand, we shift our focus to semantic analysis. This is where the type checker steps in. The type checker is responsible for ensuring the code's correctness by verifying type compatibility and identifying potential errors. We'll implement it as a separate phase, distinct from the parser, allowing us to reuse the parser in different contexts, such as a syntax highlighter or code formatter.

Our checker function traverses the AST, resolving and assigning types to nodes based on their context and declarations.

export function checker({ root }: { root: Node; errors: CompilerError[] }) {
	function resolveType(node: Node) {
		if (node.kind === 'ident') {
			const name = text(node);
			const native = nativeTypeMap[name];
			if (native) return native;
		}
	}

	function getNodeType(node: CheckedNode) {
		if (node[typeSymbol]) return node[typeSymbol];

		if (node.kind === 'def') {
			const type = node.type && resolveType(node.type);
			if (type) node.left.symbol.type = type;
		}
	}

	function checkEach(node: Node[]) {
		node.forEach(check);
	}

	function check(node: Node) {
		switch (node.kind) {
			case 'root':
			case 'main':
				return checkEach(node.children);
			case 'def':
				getNodeType(node);
		}
	}

	return {
		run: () => check(root),
		getNodeType,
	};
}

When the checker function hits our def node, it calls getNodeType. This function is designed to figure out and assign the correct type to a node. In our case, it sees the int type declaration and uses the resolveType function to grab the actual int type object. Finally, it attaches this type information to the symbol representing our variable a. This way, we know a is supposed to be an integer.

Conclusion

This post covers parsing type annotations within variable definitions, integrating the parser with an AST, managing statements and backtracking, introducing the type checker, and exploring the role of symbol tables.

In future posts we'll explore more complex type constructs, such as function types, array types, and custom types. Stay tuned.

Back to Main Page