Grammar-Lexer-Parser Pipeline

Here is a high-level view of a compiler frontend pipeline


grammar, lexer, parser

Every language needs a (formal) grammar to describe its syntax and semantics. Once a program adheres to the rules of the grammar in Source Code (for example as input string or file format), it is tokenized and then lexer adds some metadata to each token for example, where each token starts and finishes in the original source code. Lastly, parsing (reshaping or restructuring) of the lexed outputs to Abstract Syntax Tree.

Grammar

While there are varieties of ways to define the grammar, in this book we will use the Parsing Expression Grammar (PEG).

Here is how our simple calculator language Calc (supporting addition and subtraction) grammar looks like in PEG

Program = _{ SOI ~ Expr ~ EOF }

Expr = { UnaryExpr | BinaryExpr | Term }

Term = _{Int | "(" ~ Expr ~ ")" }

UnaryExpr = { Operator ~ Term }

BinaryExpr = { Term ~ (Operator ~ Term)+ }

Operator = { "+" | "-" }

Int = @{ Operator? ~ ASCII_DIGIT+ }

WHITESPACE = _{ " " | "\t" }

EOF = _{ EOI | ";" }

Filename: calculator/src/grammar.pest

This grammar basically defines the syntax and semantics where

  • each Program consists of expressions (Expr)
  • expressions are either unary (-1) or binary (1 + 2)
  • unary or binary expressions are made of Term and Operator ("+" and "-")
  • the only atom is integer Int

Given our grammar, we will use pest which is a powerful parser generator of PEG grammars. (For more details on pest, checkout the pest book

pest derives the parser CalcParser::parse from our grammar

#[derive(pest_derive::Parser)]
#[grammar = "grammar.pest"]
struct CalcParser;

Filename: calculator/src/parser.rs

and does all the steps of the frontend pipeline that we mentioned so that we can start parsing any Calc source code (source: &str) via the Rules of our grammar

CalcParser::parse(Rule::Program, source)

Before doing that, we need to define our Abstract Syntax Tree (AST) in the next section.