Home

Writing a Compiler: Lexical Analysis

lexical-analysis compiler typescript

In this post we'll explore how a scanner breaks down source code into tokens, identifying keywords, numbers, and operators.

If you're curious about how compilers work, or exploring tools like interpreters and linters, understanding this step is critical.

Lexical Analysis

The scanner, also known as the lexical analyzer, is responsible for taking your source code and breaking it down into smaller, more manageable pieces called tokens. These tokens represent elements such as identifiers, parentheses, and operators.

The source code for this version of the scanner is available here.

Demo

Setting Up

A Token will be represented by the following interface:

export interface Position {
	start: number;
	end: number;
	line: number;
	source: string;
}

export interface Token<Kind> extends Position {
	kind: Kind;
}

The next function

The next function consumes the input stream, one character at a time, and produces a sequence of tokens.

Thanks to Typescript type inference, the return type of our next function is a union of all possible tokens: Token<"eof"> | Token<"=="> | Token<"="> | ... | Token<...>

function next() {
	skipWhitespace();
	line = endLine;

	if (index >= length) return tk('eof', 0);

	const ch = source[index];
	const la = source[index + 1];

	switch (ch) {
		...
	}
}

The function starts by skipping any leading whitespace. This preprocessing step simplifies token identification by avoiding unnecessary checks for spaces or line breaks.

If the function has reached the very end of the code, it returns a special eof token, indicating the completion of the tokenization process.

Scanning Operators

We use the current character and a lookahead (the next character) to recognize these operators. The scanner prioritizes two-character operators to ensure the longest operators are recognized and given precedence.

	...
	// 2-char operators
	case '=':
		return la === '=' ? tk('==', 2) : tk('=', 1);
	case '|':
		return la === '|' ? tk('||', 2) : tk('|', 1);
	case '&':
		return la === '&' ? tk('&&', 2) : tk('&', 1);
	case '>':
		return la === '='
			? tk('>=', 2)
			: la === '>'
			? tk('>>', 2)
			: tk('>', 1);
	case '<':
		return la === '=' ? tk('<=', 2) : tk('<', 1);

	// 1-char operators
	case '{':
	case '}':
	case '.':
	case ',':
	case '?':
	case ':':
	case '*':
	case '/':
	case '~':
	case '!':
	case '(':
	case ')':
	case '+':
	case '-':
		return tk(ch, 1);

Scanning Keywords

There are two approaches for matching keywords: using regular expressions or employing a Trie structure.

Regular Expression Method

The regular expression approach is straightforward and works by defining a pattern to match specific keywords, as shown below. It ensures that the match is followed by a non-word character or the end of the input.

const keywordRegex = /^(fn|var|main)(?:[^\w_]|$)/;

This method is simple to implement but may become less efficient as the list of keywords grows.

The function below uses a regular expression to match the scanner's current position.

export type MapToToken<T extends string> = T extends infer U ? Token<U> : never;

function matchRegex<T extends string>(
	regex: RegExp,
): MapToToken<T> | undefined {
	const m = regex.exec(source.slice(index));
	const token = m && (m[1] ?? m[0]);
	return token ? (tk(token, token.length) as MapToToken<T>) : undefined;
}

We use the matchRegex function in the scanner's default case to determine if the current text matches any of the keywords. This happens after all earlier checks (like digits or operators) fail to identify a valid token. The keyword's token is then returned if a match is found.

...
	default:
		const keywordToken = matchRegex<'fn' | 'var' | 'main'>(
			keywordRegex,
		);
		if (keywordToken) return keywordToken;
...

The type parameter of matchRegex ensures that the returned value is strongly typed, corresponding to one of the token kinds. In the example, keywordToken is inferred to be a union type: Token<'fn'> | Token<'var'> | Token<'main'> | undefined.

Trie Structure Method

A Trie is a tree-like structure where each node represents a character in a keyword. By traversing the Trie for each input character, common prefixes among keywords are reused efficiently. This approach ensures rapid matching while maintaining clarity and flexibility. For example, a Trie can stop matching if the next character doesn't belong to any valid keyword, improving performance.

To construct a Trie, we use the function below:

export type TrieNode = { [K in string]: TrieNode } & { [TrieMatch]?: string };
const TrieMatch = Symbol('TrieMatch');

export function createTrie<T extends string>(...map: T[]) {
	const trie: TrieNode = {};

	for (const token of map) {
		let current: TrieNode = trie;
		for (const char of token) current = current[char] ??= {};
		current[TrieMatch] = token;
	}
	return trie;
}

This function takes an array of keywords and builds a Trie. Each character of a keyword maps to a nested object, creating a path within the Trie. When the final character of a keyword is inserted, a special marker (TrieMatch) is added to signify that this path forms a complete keyword.

The scanner uses this Trie to match tokens at the current position. The following function creates a matcher that checks the input against the Trie and returns a token if it matches any keyword. It consumes characters one by one and verifies if the input forms a valid keyword while ensuring the next character is not part of a larger identifier. If a match is found, the function creates and returns the corresponding token. If no match exists, it stops and returns nothing.

function createTrieMatcher<T extends string>(map: readonly T[], end: MatchFn) {
	const trie = createTrie(...map);
	return (): MapToToken<T> | undefined => {
		let ch = source[index];
		let consumed = 0;
		let node = trie;
		while ((node = node[ch])) {
			consumed++;
			ch = source[index + consumed];
			if (node[TrieMatch] && end(ch))
				return tk(node[TrieMatch], consumed) as MapToToken<T>;
		}
	};
}

Once we have the matcher, we integrate it into the scanner process. By including the matcher in the default case of the scanner's switch block, we check for keywords only after all other token types are processed.

export const keywords = ['fn', 'var', 'main'] as const;
const keywordMatcher = createTrieMatcher(keywords, notIdent);
...
	default:
		const keywordToken = keywordMatcher();
		if (keywordToken) return keywordToken;

Scanning Strings

The following code snippet isolates string literals enclosed in single quotes, considering escape sequences and line breaks, and creates the corresponding tokens.

...
	case "'": {
		let n = 1;
		while (
			index + n < length &&
			(source[index + n] !== "'" ||
				source[index + n - 1] === '\\')
		) {
			if (source[index + n] === '\n') endLine++;
			n++;
		}

		if (source[index + n] !== "'")
			throw error('Unterminated string', n);

		return tk('string', n + 1);
	}

Scanning Comments

The following code handles comment tokenization. When encountering a '#', it loops until a newline character, capturing the entire comment content. It then creates a token of type comment with the total consumed characters as its length.

...
	case '#': {
		let n = 1;
		while (index + n < length && source[index + n] !== '\n') n++;
		return tk('comment', n);
	}

Scanning Numbers

Our scanner recognizes digits, hexadecimal (prefixed with '0x'), and binary (prefixed with '0b') notations. It also supports decimal numbers with optional decimal points followed by digits.

const digit = (ch: string) => (ch >= '0' && ch <= '9') || ch === '_';
const hexDigit = (ch: string) =>
	digit(ch) || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F');
const binaryDigit = (ch: string) => ch === '0' || ch === '1' || ch === '_';

function matchWhile(match: MatchFn, consumed = 0) {
	while (index + consumed < length && match(source[index + consumed]))
		consumed++;
	return consumed;
}
	...

	case '0':
		if (la === 'x') {
			const consumed = matchWhile(hexDigit, 2);
			if (consumed === 2 || ident(current(consumed)))
				throw error('Expected hexadecimal digit', consumed + 1);
			return tk('number', consumed);
		}
		if (la === 'b') {
			const consumed = matchWhile(binaryDigit, 2);
			if (consumed === 2 || ident(current(consumed)))
				throw error('Expected binary digit', consumed + 1);
			return tk('number', consumed);
		}
	case '1':
	case '2':
	case '3':
	case '4':
	case '5':
	case '6':
	case '7':
	case '8':
	case '9': {
		let consumed = matchWhile(digit, 1);
		if (consumed && current(consumed) === '.') {
			const decimals = matchWhile(digit, ++consumed);
			if (decimals === consumed)
				throw error('Expected digit', consumed);
			consumed = decimals;
		}
		if (ident(current(consumed)))
			throw error('Expected digit', consumed + 1);

		return tk('number', consumed);
	}

Scanning Identifiers

The following code is used to match identifiers. It first checks if the current character ch is a valid starting character for an identifier, as defined by the identFirst regular expression. If it is, it then uses the matchWhile function to consume subsequent characters that also comply with the ident regular expression (letters, digits, and underscores).

const identFirst = (ch: string) => ch === '_' || alpha(ch);
const ident = (ch: string) => ch === '_' || _ident.test(ch);
	...
	default:
		// Identifiers
		if (identFirst(ch))
			return tk('ident', matchWhile(ident, 1));

		throw error(`Invalid character "${ch}"`, 1);

Error Handling

export class CompilerError {
	constructor(
		public message: string,
		public position: Position,
	) {}
}
...
function error(message: string, consumed = 0, start = index) {
	index += consumed;
	return new CompilerError(message, {
		start,
		end: index,
		line,
		source,
	});
}
...

When an error is encountered, the function calls the error function to create a new CompilerError object. This object encapsulates details about the error, such as:

The error message (message).
The starting position of the error (start).
The ending position of the error (end).
The line number where the error occurred (line).
The source code being parsed (source).

The error function is invoked in the following error scenarios:

Unterminated strings (case "'": ...)
Incomplete hexadecimal or binary numbers (case '0': ...)
Invalid characters within numbers (case '1': ... '9': ...)
Invalid characters that don't match any valid token (default: ...)

Scanner as an Iterator

The each function converts our scanner to an iterator, so it is possible to use with for..of loops or with any other function that supports iterators.

type ScanFn<Node extends Token<string>> = () => Node;

export function each<Kind extends string>(scan: ScanFn<Node>) {
	return {
		[Symbol.iterator]() {
			return {
				next() {
					const value = scan();
					return value.kind === 'eof'
						? { done: true, value }
						: { value };
				},
			};
		},
	};
}

for (const token of each(scan('Text').next)) console.log(token);

Back to Main Page