Building Your First AI Model Inference Engine in Rust

Rust is rapidly gaining traction in the world of artificial intelligence (AI) and machine learning (ML) due to its performance, safety, and concurrency features. As of March 31, 2025, building an AI model inference engine in Rust is an exciting way to leverage these strengths for fast, reliable model deployment. Whether you’re a beginner dipping your toes into ML or an experienced developer seeking efficient inference, this ultra-detailed guide will walk you through the process step-by-step using three powerful Rust libraries: tract, onnxruntime, and burn.

Inference—the process of using a trained model to make predictions—is a critical step in deploying AI solutions. In this article, we’ll cover everything from setting up your Rust environment to running a pre-trained model with each library, complete with code examples, pros, cons, and tips. By the end, you’ll have a fully functional inference engine tailored to your needs. Let’s get started!

Why Use Rust for AI Model Inference?

Rust offers unique advantages for AI inference:

Performance: Near-C++ speed with zero-cost abstractions.
Safety: Memory safety guarantees eliminate common bugs.
Portability: Deploy on desktops, servers, or even WebAssembly.
Growing Ecosystem: Libraries like tract, onnxruntime, and burn are maturing rapidly.

Compared to Python, Rust shines in production environments where latency and resource efficiency matter. This guide focuses on three libraries:

Tract: A pure-Rust, lightweight ONNX inference engine.
ONNX Runtime (ort): A Rust wrapper for Microsoft’s high-performance ONNX Runtime.
Burn: A flexible, Rust-native deep learning framework for training and inference.

Prerequisites

Before diving in, ensure you have:

Rust Installed: Run curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh to install Rust (version 1.75+ recommended as of 2025).
Cargo: Rust’s package manager, included with Rust.
A Pre-trained Model: We’ll use an ONNX model (e.g., a simple MNIST digit classifier) downloadable from the ONNX Model Zoo.
Basic Rust Knowledge: Familiarity with structs, enums, and Cargo.toml.
Optional: NVIDIA GPU with CUDA for GPU acceleration (for onnxruntime).

Download the MNIST ONNX model from the ONNX Model Zoo (e.g., mnist-8.onnx) and place it in your project directory for this tutorial.

Step 1: Setting Up Your Rust Project

Create a new Rust project:

cargo new rust-ai-inference
cd rust-ai-inference

Edit Cargo.toml to include dependencies for all three libraries (we’ll use them selectively):

[package]
name = "rust-ai-inference"
version = "0.1.0"
edition = "2021"

[dependencies]
tract-onnx = "0.21"  # For tract
ort = "2.0"          # For onnxruntime
burn = { version = "0.13", features = ["ndarray", "onnx"] }  # For burn
ndarray = "0.15"     # For tensor handling
image = "0.24"       # For preprocessing MNIST images

Run cargo build to fetch dependencies. This sets the stage for our inference engine.

Step 2: Preparing Input Data (MNIST Example)

For consistency, we’ll preprocess a sample MNIST image (28×28 grayscale digit) to use with all three libraries. Create a function in src/main.rs:

use image::{ImageBuffer, Luma};
use ndarray::{Array, Array4};

fn preprocess_image(path: &str) -> Array4 {
    let img = image::open(path).unwrap().to_luma8();
    let resized = image::imageops::resize(&img, 28, 28, image::imageops::FilterType::Nearest);
    let tensor = Array::from_shape_vec((1, 1, 28, 28), resized.into_raw())
        .unwrap()
        .mapv(|x| x as f32 / 255.0); // Normalize to [0, 1]
    tensor
}

Save a sample MNIST digit image (e.g., digit.png) in your project directory. This function loads, resizes, and normalizes the image into a 4D tensor (batch size, channels, height, width).

Option 1: Running Inference with Tract

What is Tract?

Tract is a pure-Rust library for running ONNX and TensorFlow models. It’s lightweight, doesn’t rely on external C++ dependencies, and supports WebAssembly, making it ideal for portable inference.

Step 3a: Load and Configure the Model

In src/main.rs, add this Tract-based inference code:

use tract_onnx::prelude::*;

fn run_tract_inference(model_path: &str, input: Array4) -> anyhow::Result> {
    let model = tract_onnx::onnx()
        .model_for_path(model_path)?
        .with_input_fact(0, InferenceFact::dt_shape(f32::datum_type(), tvec!(1, 1, 28, 28)))?
        .into_optimized()?
        .into_runnable()?;
    
    let input_tensor: Tensor = input.into();
    let result = model.run(tvec!(input_tensor))?;
    let output = result[0].to_array_view::()?.to_vec();
    Ok(output)
}

Step 4a: Execute Inference

Update main to use Tract:

fn main() -> anyhow::Result<()> {
    let input = preprocess_image("digit.png");
    let output = run_tract_inference("mnist-8.onnx", input)?;
    println!("Tract Output: {:?}", output);
    Ok(())
}

Run cargo run. The output is a 10-element vector representing probabilities for digits 0-9. The highest value indicates the predicted digit.

Pros and Cons of Tract

Pros: Pure Rust, no external dependencies, WebAssembly support.
Cons: Limited hardware acceleration (CPU-only), smaller community.

Option 2: Running Inference with ONNX Runtime (ort)

What is ONNX Runtime?

ONNX Runtime (via the ort crate) is a Rust wrapper for Microsoft’s high-performance inference engine. It supports CPU and GPU acceleration (e.g., CUDA), making it ideal for performance-critical applications.

Step 3b: Load and Configure the Model

Add this ort-based inference code:

use ort::{Environment, Session, GraphOptimizationLevel, inputs};

fn run_ort_inference(model_path: &str, input: Array4) -> anyhow::Result> {
    let environment = Environment::builder()
        .with_name("inference")
        .with_log_level(ort::LoggingLevel::Warning)
        .build()?;
    
    let session = Session::builder()?
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        .with_intra_threads(1)?
        .commit_from_file(model_path)?;
    
    let input_tensor = input.into_tensor();
    let outputs = session.run(inputs!["data" => input_tensor]?)?;
    let output = outputs["Plus214_Output_0"]
        .try_extract_tensor::()?
        .to_vec();
    Ok(output)
}

Note: Replace "data" and "Plus214_Output_0" with your model’s input/output names (check using Netron).

Step 4b: Execute Inference

Update main for ort:

fn main() -> anyhow::Result<()> {
    let input = preprocess_image("digit.png");
    let output = run_ort_inference("mnist-8.onnx", input)?;
    println!("ort Output: {:?}", output);
    Ok(())
}

For GPU support, ensure CUDA is installed and add .with_execution_providers([ort::CUDAExecutionProvider::default()])? to the session builder.

Pros and Cons of ONNX Runtime

Pros: High performance, GPU support, broad model compatibility.
Cons: External C++ dependency, steeper setup for GPU.

Option 3: Running Inference with Burn

What is Burn?

Burn is a Rust-native deep learning framework offering flexibility for both training and inference. It supports ONNX models and multiple backends (e.g., CPU, GPU via WGPU), making it versatile for research and deployment.

Step 3c: Load and Configure the Model

Add this Burn-based inference code:

use burn::backend::ndarray::Ndarray;
use burn::module::Module;
use burn::nn::ONNXRuntime;
use burn::tensor::Tensor;

fn run_burn_inference(model_path: &str, input: Array4) -> anyhow::Result> {
    type Backend = Ndarray;
    let model: ONNXRuntime = ONNXRuntime::new(model_path)?;
    let input_tensor = Tensor::::from_data(input);
    let output_tensor = model.forward(input_tensor);
    let output = output_tensor.into_data().to_vec()?;
    Ok(output)
}

For GPU, replace Ndarray with Wgpu and add burn = { version = "0.13", features = ["wgpu"] } to Cargo.toml.

Step 4c: Execute Inference

Update main for Burn:

fn main() -> anyhow::Result<()> {
    let input = preprocess_image("digit.png");
    let output = run_burn_inference("mnist-8.onnx", input)?;
    println!("Burn Output: {:?}", output);
    Ok(())
}

Pros and Cons of Burn

Pros: Rust-native, flexible backends, training support.
Cons: Younger ecosystem, WGPU setup complexity.

Step 5: Comparing Results

Run all three implementations and compare outputs. For MNIST, each should predict the same digit (e.g., a “7” with a high probability at index 7). Differences may arise from floating-point precision or optimization levels, but predictions should align closely.

Step 6: Optimizing Your Inference Engine

Tract: Use .into_optimized() for graph optimizations.

ort: Enable GraphOptimizationLevel::Level3 and GPU execution providers.

Burn: Switch to WGPU backend or profile with cargo flamegraph.

Profile performance with cargo run --release and tools like perf on Linux.

Step 7: Deploying Your Engine

Desktop/Server: Package with cargo build --release.
Web: Use Tract with WebAssembly (wasm-pack).
Embedded: Cross-compile with Burn or Tract for ARM.

Conclusion: Choosing Your Inference Path

Tract: Best for lightweight, portable inference without external dependencies.

ONNX Runtime: Ideal for high-performance needs with GPU support.

Burn: Perfect for Rust enthusiasts wanting flexibility and future-proofing.

This ultra-long guide has equipped you to build your first AI model inference engine in Rust using tract, onnxruntime, or burn. Experiment with each to find your fit, and happy coding in 2025!

Blog