Optimizing LLVM IR
So far, our compiler generates correct but naive LLVM IR. Every variable gets a stack allocation (alloca), every parameter is stored and reloaded, and the code includes redundant operations. In this chapter, we integrate LLVM’s pass manager to optimize our IR before execution.
Prerequisites: This chapter assumes you understand LLVM IR syntax. If instructions like
alloca,load,store, andgetelementptrare unfamiliar, read From AST to IR first.
Requirement: This chapter requires LLVM 20+ because the
run_passes()API we use is part of LLVM’s New Pass Manager, which is fully supported in LLVM 20. Check your version withllvm-config --version.If you have a different LLVM version, update
thirdlang/Cargo.toml:# For LLVM 20.x inkwell = { version = "0.7", features = ["llvm20-1"] }
What Are Optimization Passes?
An optimization pass is a transformation that improves the IR without changing its behavior. LLVM provides dozens of passes that work together in a pipeline:
| Pass | What It Does |
|---|---|
mem2reg | Promotes stack allocations to SSA registers |
dce | Removes dead (unused) code |
instcombine | Combines redundant instructions |
simplifycfg | Simplifies control flow graph |
The Optimization Pipeline in Action
Let us trace how each pass transforms the IR for a simple increment method. This is the most important section to understand.
Step 0: Original Source
Create examples/counter.tl with a Counter class (or use the existing one):
class Counter {
value: int
def increment(self) -> int {
self.value = self.value + 1
return self.value
}
}
Step 1: Unoptimized IR (what our codegen produces)
Generate with:
cd thirdlang
rustup run nightly cargo run -- --ir examples/counter.tl
Output (showing just the increment method):
define i64 @Counter__increment(ptr %self) {
entry:
%self1 = alloca ptr, align 8 ; allocate stack slot for self
store ptr %self, ptr %self1, align 8 ; store parameter to stack
%self2 = load ptr, ptr %self1, align 8 ; load self from stack
%field_ptr = getelementptr %Counter, ptr %self2, i32 0, i32 0
%field = load i64, ptr %field_ptr ; load self.value
%add = add i64 %field, 1 ; add 1
%self3 = load ptr, ptr %self1, align 8 ; load self again!
%field_ptr4 = getelementptr %Counter, ptr %self3, i32 0, i32 0
store i64 %add, ptr %field_ptr4 ; store to self.value
%self5 = load ptr, ptr %self1, align 8 ; load self a third time!
%field_ptr6 = getelementptr %Counter, ptr %self5, i32 0, i32 0
%field7 = load i64, ptr %field_ptr6 ; load self.value for return
ret i64 %field7
}
Problems: 14 instructions, redundant loads, stack allocation for a parameter.
Step 2: After mem2reg
Generate with:
rustup run nightly cargo run -- --ir --passes "mem2reg" examples/counter.tl
The mem2reg pass promotes the alloca to an SSA register:
define i64 @Counter__increment(ptr %self) {
entry:
; No more alloca! %self is used directly
%field_ptr = getelementptr %Counter, ptr %self, i32 0, i32 0
%field = load i64, ptr %field_ptr
%add = add i64 %field, 1
%field_ptr4 = getelementptr %Counter, ptr %self, i32 0, i32 0
store i64 %add, ptr %field_ptr4
%field_ptr6 = getelementptr %Counter, ptr %self, i32 0, i32 0
%field7 = load i64, ptr %field_ptr6
ret i64 %field7
}
Result: Eliminated alloca, store, and 3 redundant loads of %self.
Step 3: After mem2reg,instcombine
Generate with:
rustup run nightly cargo run -- --ir --passes "mem2reg,instcombine" examples/counter.tl
The instcombine pass merges redundant GEP instructions and simplifies patterns:
define i64 @Counter__increment(ptr %self) {
entry:
%field = load i64, ptr %self ; GEPs merged, %self IS the field ptr
%add = add i64 %field, 1
store i64 %add, ptr %self
%field7 = load i64, ptr %self ; still loading for return
ret i64 %field7
}
Result: Since Counter has only one field at offset 0, GEP simplifies away.
Step 4: After full pipeline (with dce)
Generate with:
rustup run nightly cargo run -- --ir -O examples/counter.tl
Or explicitly:
rustup run nightly cargo run -- --ir --passes "mem2reg,instcombine,dce,simplifycfg" examples/counter.tl
The dce pass removes the unnecessary final load - we already have the value in %add:
define i64 @Counter__increment(ptr %self) {
entry:
%field = load i64, ptr %self, align 4
%add = add i64 %field, 1
store i64 %add, ptr %self, align 4
ret i64 %add ; return %add directly!
}
Final Result: 4 instructions instead of 14!
Summary of the Pipeline
Implementation
The run_passes Method
Inkwell provides access to the pass manager through Module::run_passes():
/// Run LLVM optimization passes using the New Pass Manager
///
/// # Arguments
/// * `passes` - A comma-separated list of passes, e.g., "dce,mem2reg,instcombine"
/// or a preset like "default<O2>"
///
/// # Common passes for teaching:
/// - `dce` - Dead Code Elimination
/// - `mem2reg` - Promote allocas to SSA registers
/// - `instcombine` - Combine redundant instructions
/// - `simplifycfg` - Simplify control flow graph
/// - `gvn` - Global Value Numbering
/// - `default<O0>` through `default<O3>` - Standard optimization levels
pub fn run_passes(&self, passes: &str) -> Result<(), String> {
// Initialize native target for the current machine
Target::initialize_native(&InitializationConfig::default())
.map_err(|e| format!("Failed to initialize native target: {}", e))?;
// Get the default target triple for this machine
let triple = TargetMachine::get_default_triple();
// Get the target from the triple
let target = Target::from_triple(&triple)
.map_err(|e| format!("Failed to get target from triple: {}", e))?;
// Create target machine with default settings
let target_machine = target
.create_target_machine(
&triple,
"generic", // CPU
"", // Features
OptimizationLevel::Default,
RelocMode::Default,
CodeModel::Default,
)
.ok_or_else(|| "Failed to create target machine".to_string())?;
// Create pass builder options
let pass_options = PassBuilderOptions::create();
pass_options.set_verify_each(true); // Verify IR after each pass
// Run the passes
self.module
.run_passes(passes, &target_machine, pass_options)
.map_err(|e| format!("Failed to run passes: {}", e))
}
Key points:
- Initialize Native Target: Required before creating a TargetMachine
- Get Target Triple: The host machine description (e.g.,
x86_64-apple-darwin) - Create TargetMachine: Needed for target-specific optimizations
- run_passes: Takes a comma-separated list of passes
Pass Pipeline String
The passes argument is a string like:
"dce,mem2reg,instcombine"- Custom pipeline"default<O2>"- LLVM’s standard O2 optimization
Integration with JIT
We add an optional optimization step to our JIT runner:
/// JIT compile and run a program with optional optimization passes
///
/// # Arguments
/// * `program` - The parsed and type-checked program
/// * `classes` - Class registry from type checking
/// * `passes` - Optional optimization passes (e.g., "dce,mem2reg,instcombine")
pub fn jit_run_with_opts(
program: &Program,
classes: ClassRegistry,
passes: Option<&str>,
) -> Result<i64, String> {
let context = Context::create();
let mut codegen = CodeGen::new(&context, "thirdlang", classes);
codegen.compile(program)?;
// Run optimization passes if specified
if let Some(pass_pipeline) = passes {
codegen.run_passes(pass_pipeline)?;
}
// Create execution engine
let engine = codegen
.module
.create_jit_execution_engine(OptimizationLevel::Default)
.map_err(|e| format!("Failed to create JIT: {}", e.to_string()))?;
// Call the __main wrapper function
unsafe {
let func: inkwell::execution_engine::JitFunction<unsafe extern "C" fn() -> i64> =
engine.get_function("__main").map_err(|e| e.to_string())?;
Ok(func.call())
}
}
Understanding Each Pass
mem2reg: The Essential Pass
mem2reg converts stack-allocated variables to SSA registers. This is critical because our codegen creates allocas for every variable, but registers are much faster than memory.
Before:
%x = alloca i64
store i64 42, ptr %x
%val = load i64, ptr %x
After:
%val = 42
dce: Dead Code Elimination
Removes instructions whose results are never used:
Before:
%unused = add i64 %a, %b ; result never used
%result = mul i64 %c, %d
ret i64 %result
After:
%result = mul i64 %c, %d
ret i64 %result
instcombine: Instruction Combining
Simplifies patterns:
sub i64 %x, 1becomesadd i64 %x, -1mul i64 %x, 2becomesshl i64 %x, 1(shift left)- Constant folding:
add i64 3, 4becomes7
simplifycfg: Control Flow Simplification
Cleans up the control flow graph:
- Removes empty basic blocks
- Merges blocks with single predecessors
- Simplifies trivial branches
Using the CLI
# Run without optimization
thirdlang examples/point.tl
# Run with optimization
thirdlang -O examples/point.tl
# Run with custom passes
thirdlang --passes "mem2reg,dce" examples/point.tl
# Print unoptimized IR
thirdlang --ir examples/point.tl
# Print optimized IR
thirdlang --ir -O examples/point.tl
# Use LLVM's O2 pipeline
thirdlang --passes "default<O2>" examples/point.tl
Testing Optimization
We verify that optimization produces correct results:
#[test]
fn test_optimization_pipeline() {
use thirdlang::compile_to_ir_with_opts;
let source = r#"
class Simple {
value: int
def __init__(self, v: int) {
self.value = v
}
def get(self) -> int {
return self.value
}
}
s = new Simple(100)
result = s.get()
delete s
result
"#;
// Get unoptimized IR
let unopt_ir = compile_to_ir_with_opts(source, None).unwrap();
// Get optimized IR
let opt_ir = compile_to_ir_with_opts(source, Some("mem2reg,dce,instcombine")).unwrap();
// Optimized IR should be shorter (fewer alloca instructions)
assert!(
opt_ir.len() < unopt_ir.len(),
"Optimized IR should be smaller"
);
// Unoptimized should have alloca for parameters
assert!(
unopt_ir.contains("alloca"),
"Unoptimized IR should have allocas"
);
}
Optimization Levels
LLVM provides preset pipelines:
| Level | Description |
|---|---|
default<O0> | No optimization (verification only) |
default<O1> | Light optimization |
default<O2> | Standard optimization (recommended) |
default<O3> | Aggressive optimization |
For teaching, we use dce,mem2reg,instcombine,simplifycfg to see each pass individually.
Summary
We added LLVM optimization:
- Created
run_passes()method that accepts a pipeline string - Integrated optimization into the JIT runner
- Added
-Oand--passesCLI flags
The key insight: our naive codegen produces correct but inefficient IR. LLVM passes transform it into efficient code:
mem2reg- Eliminates stack allocationsinstcombine- Merges redundant instructionsdce- Removes unused code
Try thirdlang --ir examples/counter.tl vs thirdlang --ir -O examples/counter.tl to see the difference!
Cross-References
- LLVM Code Generation - How we generate the initial IR
- From AST to IR - IR concepts from Secondlang