In this post we'll explore how a scanner breaks down source code into tokens, identifying keywords, numbers, and operators.
If you're curious about how compilers work, or exploring tools like interpreters and linters, understanding this step is critical.
The scanner, also known as the lexical analyzer, is responsible for taking your source code and breaking it down into smaller, more manageable pieces called tokens. These tokens represent elements such as identifiers, parentheses, and operators.
The source code for this version of the scanner is available here.
A Token will be represented by the following interface:
export interface Position {
start: number;
end: number;
line: number;
source: string;
}
export interface Token<Kind> extends Position {
kind: Kind;
}
The next function consumes the input stream, one character at a time, and produces a sequence of tokens.
Thanks to Typescript type inference, the return type of our next function is a union of all possible tokens: Token<"eof"> | Token<"=="> | Token<"="> | ... | Token<...>
function next() {
skipWhitespace();
line = endLine;
if (index >= length) return tk('eof', 0);
const ch = source[index];
const la = source[index + 1];
switch (ch) {
...
}
}
The function starts by skipping any leading whitespace. This preprocessing step simplifies token identification by avoiding unnecessary checks for spaces or line breaks.
If the function has reached the very end of the code, it returns a special eof
token, indicating the completion of the tokenization process.
We use the current character and a lookahead (the next character) to recognize these operators. The scanner prioritizes two-character operators to ensure the longest operators are recognized and given precedence.
...
// 2-char operators
case '=':
return la === '=' ? tk('==', 2) : tk('=', 1);
case '|':
return la === '|' ? tk('||', 2) : tk('|', 1);
case '&':
return la === '&' ? tk('&&', 2) : tk('&', 1);
case '>':
return la === '='
? tk('>=', 2)
: la === '>'
? tk('>>', 2)
: tk('>', 1);
case '<':
return la === '=' ? tk('<=', 2) : tk('<', 1);
// 1-char operators
case '{':
case '}':
case '.':
case ',':
case '?':
case ':':
case '*':
case '/':
case '~':
case '!':
case '(':
case ')':
case '+':
case '-':
return tk(ch, 1);
There are two approaches for matching keywords: using regular expressions or employing a Trie structure.
The regular expression approach is straightforward and works by defining a pattern to match specific keywords, as shown below. It ensures that the match is followed by a non-word character or the end of the input.
const keywordRegex = /^(fn|var|main)(?:[^\w_]|$)/;
This method is simple to implement but may become less efficient as the list of keywords grows.
The function below uses a regular expression to match the scanner's current position.
export type MapToToken<T extends string> = T extends infer U ? Token<U> : never;
function matchRegex<T extends string>(
regex: RegExp,
): MapToToken<T> | undefined {
const m = regex.exec(source.slice(index));
const token = m && (m[1] ?? m[0]);
return token ? (tk(token, token.length) as MapToToken<T>) : undefined;
}
We use the matchRegex
function in the scanner's default
case to determine if the current text matches any of the keywords. This happens after all earlier checks (like digits or operators) fail to identify a valid token. The keyword's token is then returned if a match is found.
...
default:
const keywordToken = matchRegex<'fn' | 'var' | 'main'>(
keywordRegex,
);
if (keywordToken) return keywordToken;
...
The type parameter of matchRegex
ensures that the returned value is strongly typed, corresponding to one of the token kinds. In the example, keywordToken
is inferred to be a union type: Token<'fn'> | Token<'var'> | Token<'main'> | undefined
.
A Trie is a tree-like structure where each node represents a character in a keyword. By traversing the Trie for each input character, common prefixes among keywords are reused efficiently. This approach ensures rapid matching while maintaining clarity and flexibility. For example, a Trie can stop matching if the next character doesn't belong to any valid keyword, improving performance.
To construct a Trie, we use the function below:
export type TrieNode = { [K in string]: TrieNode } & { [TrieMatch]?: string };
const TrieMatch = Symbol('TrieMatch');
export function createTrie<T extends string>(...map: T[]) {
const trie: TrieNode = {};
for (const token of map) {
let current: TrieNode = trie;
for (const char of token) current = current[char] ??= {};
current[TrieMatch] = token;
}
return trie;
}
This function takes an array of keywords and builds a Trie. Each character of a keyword maps to a nested object, creating a path within the Trie. When the final character of a keyword is inserted, a special marker (TrieMatch
) is added to signify that this path forms a complete keyword.
The scanner uses this Trie to match tokens at the current position. The following function creates a matcher that checks the input against the Trie and returns a token if it matches any keyword. It consumes characters one by one and verifies if the input forms a valid keyword while ensuring the next character is not part of a larger identifier. If a match is found, the function creates and returns the corresponding token. If no match exists, it stops and returns nothing.
function createTrieMatcher<T extends string>(map: readonly T[], end: MatchFn) {
const trie = createTrie(...map);
return (): MapToToken<T> | undefined => {
let ch = source[index];
let consumed = 0;
let node = trie;
while ((node = node[ch])) {
consumed++;
ch = source[index + consumed];
if (node[TrieMatch] && end(ch))
return tk(node[TrieMatch], consumed) as MapToToken<T>;
}
};
}
Once we have the matcher, we integrate it into the scanner process. By including the matcher in the default
case of the scanner's switch block, we check for keywords only after all other token types are processed.
export const keywords = ['fn', 'var', 'main'] as const;
const keywordMatcher = createTrieMatcher(keywords, notIdent);
...
default:
const keywordToken = keywordMatcher();
if (keywordToken) return keywordToken;
The following code snippet isolates string literals enclosed in single quotes, considering escape sequences and line breaks, and creates the corresponding tokens.
...
case "'": {
let n = 1;
while (
index + n < length &&
(source[index + n] !== "'" ||
source[index + n - 1] === '\\')
) {
if (source[index + n] === '\n') endLine++;
n++;
}
if (source[index + n] !== "'")
throw error('Unterminated string', n);
return tk('string', n + 1);
}
The following code handles comment tokenization. When encountering a '#', it loops until a newline character, capturing the entire comment content. It then creates a token of type comment
with the total consumed characters as its length.
...
case '#': {
let n = 1;
while (index + n < length && source[index + n] !== '\n') n++;
return tk('comment', n);
}
Our scanner recognizes digits, hexadecimal (prefixed with '0x'), and binary (prefixed with '0b') notations. It also supports decimal numbers with optional decimal points followed by digits.
const digit = (ch: string) => (ch >= '0' && ch <= '9') || ch === '_';
const hexDigit = (ch: string) =>
digit(ch) || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F');
const binaryDigit = (ch: string) => ch === '0' || ch === '1' || ch === '_';
function matchWhile(match: MatchFn, consumed = 0) {
while (index + consumed < length && match(source[index + consumed]))
consumed++;
return consumed;
}
...
case '0':
if (la === 'x') {
const consumed = matchWhile(hexDigit, 2);
if (consumed === 2 || ident(current(consumed)))
throw error('Expected hexadecimal digit', consumed + 1);
return tk('number', consumed);
}
if (la === 'b') {
const consumed = matchWhile(binaryDigit, 2);
if (consumed === 2 || ident(current(consumed)))
throw error('Expected binary digit', consumed + 1);
return tk('number', consumed);
}
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9': {
let consumed = matchWhile(digit, 1);
if (consumed && current(consumed) === '.') {
const decimals = matchWhile(digit, ++consumed);
if (decimals === consumed)
throw error('Expected digit', consumed);
consumed = decimals;
}
if (ident(current(consumed)))
throw error('Expected digit', consumed + 1);
return tk('number', consumed);
}
The following code is used to match identifiers. It first checks if the current character ch
is a valid starting character for an identifier, as defined by the identFirst
regular expression. If it is, it then uses the matchWhile
function to consume subsequent characters that also comply with the ident
regular expression (letters, digits, and underscores).
const identFirst = (ch: string) => ch === '_' || alpha(ch);
const ident = (ch: string) => ch === '_' || _ident.test(ch);
...
default:
// Identifiers
if (identFirst(ch))
return tk('ident', matchWhile(ident, 1));
throw error(`Invalid character "${ch}"`, 1);
export class CompilerError {
constructor(
public message: string,
public position: Position,
) {}
}
...
function error(message: string, consumed = 0, start = index) {
index += consumed;
return new CompilerError(message, {
start,
end: index,
line,
source,
});
}
...
When an error is encountered, the function calls the error
function to create a new CompilerError
object.
This object encapsulates details about the error, such as:
The error function is invoked in the following error scenarios:
The each function converts our scanner to an iterator, so it is possible to use with for..of
loops or with any other function that supports iterators.
type ScanFn<Node extends Token<string>> = () => Node;
export function each<Kind extends string>(scan: ScanFn<Node>) {
return {
[Symbol.iterator]() {
return {
next() {
const value = scan();
return value.kind === 'eof'
? { done: true, value }
: { value };
},
};
},
};
}
for (const token of each(scan('Text').next)) console.log(token);