Grammar-Lexer-Parser Pipeline
Here is a high-level view of a compiler frontend pipeline
Every language needs a (formal) grammar to describe its syntax and semantics. Once a program adheres to the rules of the grammar in Source Code (for example as input string or file format), it is tokenized and then lexer adds some metadata to each token for example, where each token starts and finishes in the original source code. Lastly, parsing (reshaping or restructuring) of the lexed outputs to Abstract Syntax Tree.
Grammar
While there are varieties of ways to define the grammar, in this book we will use the Parsing Expression Grammar (PEG).
Here is how our simple calculator language Calc
(supporting addition and subtraction) grammar looks like in PEG
Program = _{ SOI ~ Expr ~ EOF }
Expr = { UnaryExpr | BinaryExpr | Term }
Term = _{Int | "(" ~ Expr ~ ")" }
UnaryExpr = { Operator ~ Term }
BinaryExpr = { Term ~ (Operator ~ Term)+ }
Operator = { "+" | "-" }
Int = @{ Operator? ~ ASCII_DIGIT+ }
WHITESPACE = _{ " " | "\t" }
EOF = _{ EOI | ";" }
Filename: calculator/src/grammar.pest
This grammar basically defines the syntax and semantics where
- each
Program
consists of expressions (Expr
) - expressions are either unary (
-1
) or binary (1 + 2
) - unary or binary expressions are made of
Term
andOperator
("+"
and"-"
) - the only atom is integer
Int
Given our grammar, we will use pest which is a powerful parser generator of PEG grammars. (For more details on pest, checkout the pest book
pest
derives the parser CalcParser::parse
from our grammar
#[derive(pest_derive::Parser)]
#[grammar = "grammar.pest"]
struct CalcParser;
Filename: calculator/src/parser.rs
and does all the steps of the frontend pipeline that we mentioned so that we can start parsing any Calc
source code (source: &str
) via the Rule
s of our grammar
CalcParser::parse(Rule::Program, source)
Before doing that, we need to define our Abstract Syntax Tree (AST) in the next section.