Engineering Architect, AI & HPC Compilation

Why this role exists

Every AI silicon company ships a compiler. Almost none of them ship a good one. The gap between what MLIR can represent and what custom hardware can execute is where performance goes to die — and most internal teams lack the compiler depth to close it. They build a functional pipeline that leaves 40% of the hardware’s capability unused, declare victory, and move on.

The same gap exists in HPC. Scientific codes — whether written in Fortran, C++, or increasingly Julia — push hardware to its theoretical limits. The compilation path from a high-level computation through MLIR or LLVM to machine code that saturates the memory subsystem and fills every vector lane is where the real performance engineering happens.

We don’t stop at functional. VRULL builds the MLIR and LLVM pipelines, the graph-lowering strategies, and the kernel optimisations that extract what the silicon was actually designed to deliver. We work at the boundary that most teams avoid: where ML framework semantics, HPC runtime requirements, compiler IR, and hardware constraints all collide.

This is compilation work that doesn’t exist in textbooks yet. Custom matrix extensions, non-standard data types, inference pipelines and simulation kernels that need to hit latency and throughput targets on hardware that’s still in simulation. AI-assisted workflows let you iterate at the speed the problem demands — exploring lowering strategies, generating kernel variants, prototyping passes — while your architectural understanding ensures the output is correct.

What you’ll do

Design and build MLIR-based compilation pipelines for custom AI and HPC silicon — from framework IR to hardware-specific code generation
Develop graph-lowering strategies and kernel optimisations that close the gap between what models and scientific codes need and what hardware provides
Work with ISA design teams to prove that proposed extensions are compilable and that the compiler can actually exploit them — for both AI inference and HPC workloads
Build the compilation infrastructure for matrix-computing extensions, custom instructions, and non-standard data types
Bridge MLIR and LLVM: ensure that high-level optimisation decisions carry through to the backend code generation that matters
Explore compilation paths for modern languages — Julia’s type-specialised compilation model is a natural fit for hardware-aware code generation
Contribute to MLIR and LLVM upstream and maintain presence in both compiler communities

What we’re looking for

Deep experience with MLIR and/or LLVM — dialects, passes, lowering pipelines, not just usage
Understanding of AI framework internals (PyTorch, TensorFlow) and/or HPC runtime patterns — how computation is represented at the top of the stack and how that maps to hardware
The ability to trace a performance problem from a model or simulation kernel through the compilation pipeline to the generated machine code
Experience with at least one hardware target’s ISA at the level needed to write code generation
Interest in modern language compilation — Julia, domain-specific languages, and the compilation models that break the Fortran/C++ duopoly in HPC
Active engagement with the MLIR and LLVM communities — contributions, conference talks, working-group participation

What sets you apart

Experience building compilation pipelines for custom or pre-silicon hardware
Knowledge of quantisation, sparsity, mixed-precision compilation, or HPC-specific optimisations (stencil codes, FFT, sparse linear algebra)
A track record of closing the gap between “functionally correct” and “actually fast” on real AI or HPC workloads
Familiarity with Julia’s compiler internals or experience with Flang/gfortran on performance-critical HPC codes
The ability to work at the intersection of domains that most engineers only know one of: ML/HPC frameworks, MLIR/LLVM infrastructure, and hardware architecture

Most AI compiler roles are about maintaining an existing pipeline. This one is about building pipelines for hardware that doesn’t exist yet — for workloads that span inference, training, and scientific simulation — and making them good enough that the hardware is worth building.