Optimizing LLVM IR

So far, our compiler generates correct but naive LLVM IR. Every variable gets a stack allocation (alloca), every parameter is stored and reloaded, and the code includes redundant operations. In this chapter, we integrate LLVM’s pass manager to optimize our IR before execution.

Prerequisites: This chapter assumes you understand LLVM IR syntax. If instructions like alloca, load, store, and getelementptr are unfamiliar, read From AST to IR first.

Requirement: This chapter requires LLVM 20+ because the run_passes() API we use is part of LLVM’s New Pass Manager, which is fully supported in LLVM 20. Check your version with llvm-config --version.

If you have a different LLVM version, update thirdlang/Cargo.toml:
# For LLVM 20.x
inkwell = { version = "0.7", features = ["llvm20-1"] }

What Are Optimization Passes?

An optimization pass is a transformation that improves the IR without changing its behavior. LLVM provides dozens of passes that work together in a pipeline:

Pass	What It Does
`mem2reg`	Promotes stack allocations to SSA registers
`dce`	Removes dead (unused) code
`instcombine`	Combines redundant instructions
`simplifycfg`	Simplifies control flow graph

The Optimization Pipeline in Action

Let us trace how each pass transforms the IR for a simple increment method. This is the most important section to understand.

Step 0: Original Source

Create examples/counter.tl with a Counter class (or use the existing one):

class Counter {
    value: int

    def increment(self) -> int {
        self.value = self.value + 1
        return self.value
    }
}

Step 1: Unoptimized IR (what our codegen produces)

Generate with:

cd thirdlang
rustup run nightly cargo run -- --ir examples/counter.tl

Output (showing just the increment method):

define i64 @Counter__increment(ptr %self) {
entry:
  %self1 = alloca ptr, align 8              ; allocate stack slot for self
  store ptr %self, ptr %self1, align 8      ; store parameter to stack
  %self2 = load ptr, ptr %self1, align 8    ; load self from stack
  %field_ptr = getelementptr %Counter, ptr %self2, i32 0, i32 0
  %field = load i64, ptr %field_ptr         ; load self.value
  %add = add i64 %field, 1                  ; add 1
  %self3 = load ptr, ptr %self1, align 8    ; load self again!
  %field_ptr4 = getelementptr %Counter, ptr %self3, i32 0, i32 0
  store i64 %add, ptr %field_ptr4           ; store to self.value
  %self5 = load ptr, ptr %self1, align 8    ; load self a third time!
  %field_ptr6 = getelementptr %Counter, ptr %self5, i32 0, i32 0
  %field7 = load i64, ptr %field_ptr6       ; load self.value for return
  ret i64 %field7
}

Problems: 14 instructions, redundant loads, stack allocation for a parameter.

Step 2: After `mem2reg`

Generate with:

rustup run nightly cargo run -- --ir --passes "mem2reg" examples/counter.tl

The mem2reg pass promotes the alloca to an SSA register:

define i64 @Counter__increment(ptr %self) {
entry:
  ; No more alloca! %self is used directly
  %field_ptr = getelementptr %Counter, ptr %self, i32 0, i32 0
  %field = load i64, ptr %field_ptr
  %add = add i64 %field, 1
  %field_ptr4 = getelementptr %Counter, ptr %self, i32 0, i32 0
  store i64 %add, ptr %field_ptr4
  %field_ptr6 = getelementptr %Counter, ptr %self, i32 0, i32 0
  %field7 = load i64, ptr %field_ptr6
  ret i64 %field7
}

Result: Eliminated alloca, store, and 3 redundant loads of %self.

Step 3: After `mem2reg,instcombine`

Generate with:

rustup run nightly cargo run -- --ir --passes "mem2reg,instcombine" examples/counter.tl

The instcombine pass merges redundant GEP instructions and simplifies patterns:

define i64 @Counter__increment(ptr %self) {
entry:
  %field = load i64, ptr %self              ; GEPs merged, %self IS the field ptr
  %add = add i64 %field, 1
  store i64 %add, ptr %self
  %field7 = load i64, ptr %self             ; still loading for return
  ret i64 %field7
}

Result: Since Counter has only one field at offset 0, GEP simplifies away.

Step 4: After full pipeline (with `dce`)

Generate with:

rustup run nightly cargo run -- --ir -O examples/counter.tl

Or explicitly:

rustup run nightly cargo run -- --ir --passes "mem2reg,instcombine,dce,simplifycfg" examples/counter.tl

The dce pass removes the unnecessary final load - we already have the value in %add:

define i64 @Counter__increment(ptr %self) {
entry:
  %field = load i64, ptr %self, align 4
  %add = add i64 %field, 1
  store i64 %add, ptr %self, align 4
  ret i64 %add                              ; return %add directly!
}

Final Result: 4 instructions instead of 14!

Summary of the Pipeline

Implementation

The run_passes Method

Inkwell provides access to the pass manager through Module::run_passes():

    /// Run LLVM optimization passes using the New Pass Manager
    ///
    /// # Arguments
    /// * `passes` - A comma-separated list of passes, e.g., "dce,mem2reg,instcombine"
    ///   or a preset like "default<O2>"
    ///
    /// # Common passes for teaching:
    /// - `dce` - Dead Code Elimination
    /// - `mem2reg` - Promote allocas to SSA registers
    /// - `instcombine` - Combine redundant instructions
    /// - `simplifycfg` - Simplify control flow graph
    /// - `gvn` - Global Value Numbering
    /// - `default<O0>` through `default<O3>` - Standard optimization levels
    pub fn run_passes(&self, passes: &str) -> Result<(), String> {
        // Initialize native target for the current machine
        Target::initialize_native(&InitializationConfig::default())
            .map_err(|e| format!("Failed to initialize native target: {}", e))?;

        // Get the default target triple for this machine
        let triple = TargetMachine::get_default_triple();

        // Get the target from the triple
        let target = Target::from_triple(&triple)
            .map_err(|e| format!("Failed to get target from triple: {}", e))?;

        // Create target machine with default settings
        let target_machine = target
            .create_target_machine(
                &triple,
                "generic", // CPU
                "",        // Features
                OptimizationLevel::Default,
                RelocMode::Default,
                CodeModel::Default,
            )
            .ok_or_else(|| "Failed to create target machine".to_string())?;

        // Create pass builder options
        let pass_options = PassBuilderOptions::create();
        pass_options.set_verify_each(true); // Verify IR after each pass

        // Run the passes
        self.module
            .run_passes(passes, &target_machine, pass_options)
            .map_err(|e| format!("Failed to run passes: {}", e))
    }

Key points:

Initialize Native Target: Required before creating a TargetMachine
Get Target Triple: The host machine description (e.g., x86_64-apple-darwin)
Create TargetMachine: Needed for target-specific optimizations
run_passes: Takes a comma-separated list of passes

Pass Pipeline String

The passes argument is a string like:

"dce,mem2reg,instcombine" - Custom pipeline
"default<O2>" - LLVM’s standard O2 optimization

Integration with JIT

We add an optional optimization step to our JIT runner:

/// JIT compile and run a program with optional optimization passes
///
/// # Arguments
/// * `program` - The parsed and type-checked program
/// * `classes` - Class registry from type checking
/// * `passes` - Optional optimization passes (e.g., "dce,mem2reg,instcombine")
pub fn jit_run_with_opts(
    program: &Program,
    classes: ClassRegistry,
    passes: Option<&str>,
) -> Result<i64, String> {
    let context = Context::create();
    let mut codegen = CodeGen::new(&context, "thirdlang", classes);

    codegen.compile(program)?;

    // Run optimization passes if specified
    if let Some(pass_pipeline) = passes {
        codegen.run_passes(pass_pipeline)?;
    }

    // Create execution engine
    let engine = codegen
        .module
        .create_jit_execution_engine(OptimizationLevel::Default)
        .map_err(|e| format!("Failed to create JIT: {}", e.to_string()))?;

    // Call the __main wrapper function
    unsafe {
        let func: inkwell::execution_engine::JitFunction<unsafe extern "C" fn() -> i64> =
            engine.get_function("__main").map_err(|e| e.to_string())?;
        Ok(func.call())
    }
}

Understanding Each Pass

mem2reg: The Key Pass

mem2reg converts stack-allocated variables to SSA registers. This is critical because our codegen creates allocas for every variable, but registers are much faster than memory.

Before:

%x = alloca i64
store i64 42, ptr %x
%val = load i64, ptr %x

After:

%val = 42

dce: Dead Code Elimination

Removes instructions whose results are never used:

Before:

%unused = add i64 %a, %b    ; result never used
%result = mul i64 %c, %d
ret i64 %result

After:

%result = mul i64 %c, %d
ret i64 %result

instcombine: Instruction Combining

Simplifies patterns:

sub i64 %x, 1 becomes add i64 %x, -1
mul i64 %x, 2 becomes shl i64 %x, 1 (shift left)
Constant folding: add i64 3, 4 becomes 7

simplifycfg: Control Flow Simplification

Cleans up the control flow graph:

Removes empty basic blocks
Merges blocks with single predecessors
Simplifies trivial branches

Using the CLI

# Run without optimization
thirdlang examples/point.tl

# Run with optimization
thirdlang -O examples/point.tl

# Run with custom passes
thirdlang --passes "mem2reg,dce" examples/point.tl

# Print unoptimized IR
thirdlang --ir examples/point.tl

# Print optimized IR
thirdlang --ir -O examples/point.tl

# Use LLVM's O2 pipeline
thirdlang --passes "default<O2>" examples/point.tl

Testing Optimization

We verify that optimization produces correct results:

#[test]
fn test_optimization_pipeline() {
    use thirdlang::compile_to_ir_with_opts;

    let source = r#"
        class Simple {
            value: int

            def __init__(self, v: int) {
                self.value = v
            }

            def get(self) -> int {
                return self.value
            }
        }

        s = new Simple(100)
        result = s.get()
        delete s
        result
    "#;

    // Get unoptimized IR
    let unopt_ir = compile_to_ir_with_opts(source, None).unwrap();

    // Get optimized IR
    let opt_ir = compile_to_ir_with_opts(source, Some("mem2reg,dce,instcombine")).unwrap();

    // Optimized IR should be shorter (fewer alloca instructions)
    assert!(
        opt_ir.len() < unopt_ir.len(),
        "Optimized IR should be smaller"
    );

    // Unoptimized should have alloca for parameters
    assert!(
        unopt_ir.contains("alloca"),
        "Unoptimized IR should have allocas"
    );
}

Optimization Levels

LLVM provides preset pipelines:

Level	Description
`default<O0>`	No optimization (verification only)
`default<O1>`	Light optimization
`default<O2>`	Standard optimization (recommended)
`default<O3>`	Aggressive optimization

For teaching, we use dce,mem2reg,instcombine,simplifycfg to see each pass individually.

Summary

We added LLVM optimization:

Created run_passes() method that accepts a pipeline string
Integrated optimization into the JIT runner
Added -O and --passes CLI flags

The key insight: our naive codegen produces correct but inefficient IR. LLVM passes transform it into efficient code:

mem2reg - Eliminates stack allocations
instcombine - Merges redundant instructions
dce - Removes unused code

Try thirdlang --ir examples/counter.tl vs thirdlang --ir -O examples/counter.tl to see the difference!

Cross-References

LLVM Code Generation - How we generate the initial IR
From AST to IR - IR concepts from Secondlang

Keyboard shortcuts

Create Your Own Programming Language with Rust