Grammar-Lexer-Parser Pipeline
Here is a high-level view of a compiler frontend pipeline
Every language needs a (formal) grammar to describe its syntax and semantics. Once a program adheres to the rules of the grammar in Source Code (for example as input string or file format), it is tokenized and then lexer adds some metadata to each token for example, where each token starts and finishes in the original source code. Lastly, parsing (reshaping or restructuring) of the lexed outputs to Abstract Syntax Tree.
Grammar
While there are varieties of ways to define the grammar, in this book we will use the Parsing Expression Grammar (PEG).
Here is how our simple calculator language Calc (supporting addition and subtraction) grammar looks like in PEG
Program = _{ SOI ~ Expr ~ EOF }
Expr = { UnaryExpr | BinaryExpr | Term }
Term = _{Int | "(" ~ Expr ~ ")" }
UnaryExpr = { Operator ~ Term }
BinaryExpr = { Term ~ (Operator ~ Term)+ }
Operator = { "+" | "-" }
Int = @{ Operator? ~ ASCII_DIGIT+ }
WHITESPACE = _{ " " | "\t" }
EOF = _{ EOI | ";" }
Filename: calculator/src/grammar.pest
This grammar basically defines the syntax and semantics where
- each
Programconsists of expressions (Expr) - expressions are either unary (
-1) or binary (1 + 2) - unary or binary expressions are made of
TermandOperator("+"and"-") - the only atom is integer
Int
Given our grammar, we will use pest which is a powerful parser generator of PEG grammars. (For more details on pest, checkout the pest book
pest derives the parser CalcParser::parse from our grammar
#[derive(pest_derive::Parser)]
#[grammar = "grammar.pest"]
struct CalcParser;
Filename: calculator/src/parser.rs
and does all the steps of the frontend pipeline that we mentioned so that we can start parsing any Calc source code (source: &str) via the Rules of our grammar
CalcParser::parse(Rule::Program, source)
Before doing that, we need to define our Abstract Syntax Tree (AST) in the next section.