On-Device Machine Learning

Running Small Language Models Locally with ExecuTorch

Learn to deploy lightweight generative AI models on-device using the ExecuTorch framework for private and offline AI interactions.

AI & MLIntermediate12 min read

In this article

The Shift to Edge-Based Generative AI

Solving the Latency and Connectivity Gap
Reducing Operational Overhead

Understanding the ExecuTorch Architecture

The Compilation Pipeline
Memory Management Strategies

Model Optimization and Quantization

Weight Pruning and Sparsity
Backend Delegation for Speed

Implementing the Inference Pipeline

Integrating with Native Apps
Handling Model Versioning

Performance Profiling and Debugging

Thermal and Power Management
Edge Case Handling

The Shift to Edge-Based Generative AI

Modern software development is witnessing a significant transition from centralized cloud-based AI to local, on-device execution. While cloud providers offer massive compute resources, they introduce unavoidable latency and significant recurring costs for every inference request. By moving generative models directly onto mobile hardware, developers can provide instantaneous responses and reduce their reliance on expensive backend infrastructure.

Privacy serves as another primary driver for this architectural shift. When AI processing happens locally, sensitive user data never leaves the physical device, which inherently complies with strict data sovereignty regulations. This approach eliminates the risks associated with transmitting personal information over the network and storing it on third-party servers.

ExecuTorch emerges as the key framework for bridging the gap between flexible research models and strict mobile environments. It provides a specialized runtime and deployment path for PyTorch models, specifically optimized for the unique constraints of mobile CPUs, GPUs, and NPUs. This ensures that even complex generative architectures can run efficiently on a battery-powered device.

The transition to on-device AI is not just about performance; it is a fundamental shift in how we handle user trust and operational scalability in the era of large language models.

Solving the Latency and Connectivity Gap

Users expect seamless interactions with AI features, such as real-time text completion or image generation. Round-trip network requests can take hundreds of milliseconds or even seconds, leading to a fragmented user experience. Local inference removes the network bottleneck entirely, allowing for fluid and interactive application interfaces.

Furthermore, local execution enables offline functionality, which is critical for mobile applications used in areas with poor connectivity. Whether a user is on a plane or in a remote location, the AI features remain fully functional without requiring an active internet connection. This reliability ensures that your application provides a consistent value proposition regardless of external conditions.

Reducing Operational Overhead

Scaling a cloud-based generative AI service requires massive investments in GPU clusters and load balancing. As your user base grows, the cost of inference scales linearly, which can become prohibitively expensive for many startups and enterprises. On-device machine learning shifts the computational burden to the hardware already owned by the user.

By leveraging the client's local processing power, developers can offer AI-powered features without worrying about server-side scaling issues. This model changes the economics of AI deployment, making it possible to provide sophisticated generative features for a one-time development cost rather than a continuous operational expense.

Understanding the ExecuTorch Architecture

ExecuTorch is designed from the ground up to address the limitations of mobile environments, such as restricted memory and power envelopes. Unlike the standard PyTorch runtime used in servers, ExecuTorch uses a highly modular and lightweight approach to execute models. It separates the model preparation phase from the execution phase to minimize the footprint on the target device.

The framework relies on a new export mechanism called Torch.export, which captures the computational graph of a model in a stable format. This graph is then transformed and optimized specifically for the target hardware using various backend delegates. These delegates allow the model to run on specialized hardware like Apple Silicon or Qualcomm Hexagon DSPs for maximum efficiency.

A critical component of this architecture is the Ahead-of-Time compilation process. Instead of interpreting the model at runtime, ExecuTorch performs as much work as possible during the build phase. This results in faster startup times and more predictable memory usage, which are essential for maintaining a responsive mobile operating system.

Minimal Runtime Footprint: The core library is stripped of unnecessary components to save binary size.
High Portability: Supports various mobile platforms through a consistent C++ API.
Memory Efficiency: Uses specialized allocators to manage tensor memory without fragmentation.
Hardware Acceleration: Direct integration with mobile NPUs and GPUs through specialized delegates.

The Compilation Pipeline

The journey of a model from a research environment to a mobile device involves several distinct stages. First, the PyTorch model is exported into a standardized graph representation that captures all operations and dependencies. This stage ensures that the dynamic nature of Python is resolved into a static format suitable for embedded execution.

Next, the graph undergoes a series of optimizations, such as operator fusion and constant folding, to reduce computational complexity. Once the graph is optimized, it is converted into a Flatbuffer-based file format with the .pte extension. This file contains the model weights and the execution plan, ready to be loaded by the ExecuTorch runtime on the device.

Memory Management Strategies

Memory is the most constrained resource on mobile devices, especially when dealing with large generative models. ExecuTorch employs a memory-mapped approach for loading model weights, which avoids loading the entire model into RAM at once. This allows the operating system to manage memory pages efficiently and reduces the likelihood of the app being terminated for excessive memory usage.

The runtime also uses a pre-allocated memory plan for intermediate tensors, known as the workspace. By calculating the exact memory requirements at compile time, the framework eliminates the need for dynamic allocations during inference. This deterministic behavior prevents memory fragmentation and ensures that the application remains stable over long periods of use.

Model Optimization and Quantization

Generative AI models, particularly large language models, are often too big to fit into the memory of a standard smartphone. To deploy these models effectively, developers must use optimization techniques like quantization to reduce the model size. Quantization converts the high-precision weights of a model into lower-precision formats like 8-bit or 4-bit integers.

This process significantly reduces the storage footprint and memory bandwidth required during execution. For instance, a model that takes 4GB in 32-bit floats can be compressed to roughly 500MB using 4-bit quantization. While there is a slight trade-off in model accuracy, modern quantization algorithms are sophisticated enough to minimize this impact for most practical applications.

ExecuTorch provides a robust quantization API that integrates directly with the export pipeline. Developers can choose between Post-Training Quantization for simplicity or Quantization-Aware Training for better accuracy. This flexibility allows for fine-tuning the balance between performance, size, and model quality based on the specific requirements of the application.

pythonQuantizing a Model for Mobile Deployment

1import torch
2from torch.ao.quantization import get_default_qconfig
3from torch.export import export
4
5# Define a simple generative module wrapper
6class LightweightGenModel(torch.nn.Module):
7    def __init__(self, model):
8        super().__init__()
9        self.model = model
10
11    def forward(self, tokens):
12        # Perform localized inference logic
13        return self.model(tokens)
14
15# Load pre-trained weights
16raw_model = load_pretrained_llama_small()
17wrapper = LightweightGenModel(raw_model)
18
19# Apply 8-bit quantization using the PT2E workflow
20# This reduces memory footprint by 4x without retraining
21qconfig = get_default_qconfig('xnnpack')
22prepared_model = torch.ao.quantization.prepare_pt2e(wrapper, qconfig)
23quantized_model = torch.ao.quantization.convert_pt2e(prepared_model)
24
25# Export to ExecuTorch format
26example_input = (torch.zeros((1, 128), dtype=torch.long),)
27exported_program = export(quantized_model, example_input)
28# Save the optimized graph for mobile use

Weight Pruning and Sparsity

In addition to quantization, pruning is an effective technique for reducing the complexity of generative models. Pruning involves identifying and removing redundant parameters that contribute little to the model output. This results in a sparser model that requires fewer computations and less memory to execute.

When combined with quantization, pruning can lead to dramatic improvements in inference speed on mobile hardware. ExecuTorch supports sparse kernels that can take advantage of these patterns to skip unnecessary mathematical operations. This optimization is particularly beneficial for transformer-based architectures commonly used in generative AI.

Backend Delegation for Speed

Modern mobile chips contain specialized hardware designed for matrix multiplication, which is the core operation in machine learning. ExecuTorch uses delegates to offload these computations to the most efficient hardware available on the device. For example, on iOS devices, the CoreML delegate can be used to leverage the Apple Neural Engine.

By delegating tasks to specific hardware, the CPU is freed up for other application logic, and power consumption is significantly reduced. This is vital for maintaining battery life during prolonged AI interactions. Developers can specify multiple delegates to ensure the best possible performance across a wide range of devices and operating systems.

Implementing the Inference Pipeline

Deploying the model on the device requires a robust C++ environment to interact with the ExecuTorch runtime. The process starts by loading the exported .pte file into a memory-mapped buffer. This allows the runtime to access the model structure and weights directly from the storage without an expensive copy operation.

Once the model is loaded, developers must initialize the execution environment and provide the necessary memory allocators. The runtime uses a Method object to represent a specific entry point in the model graph. By invoking this method with input tensors, the application can trigger the inference process and receive the generated results.

Handling the input and output tensors involves converting native data types into the EValue format used by ExecuTorch. For a generative text model, this typically means converting user input into token IDs using a tokenizer before passing them to the model. The output tokens are then decoded back into human-readable text for display in the user interface.

cppExecuting Inference in C++

1#include <executorch/runtime/executor/program.h>
2#include <executorch/runtime/platform/log.h>
3
4void run_local_inference(const char* model_path) {
5    // Load the model from the file system into a buffer
6    auto model_data = read_file(model_path);
7    
8    // Initialize the ExecuTorch program
9    auto program = torch::executor::Program::load(model_data.data(), model_data.size());
10    
11    // Load the main execution method (usually 'forward')
12    auto method = program->load_method("forward");
13
14    // Prepare input tensor with user data
15    int64_t input_tokens[] = {1, 512, 1024, 7};
16    auto input_tensor = torch::executor::Tensor::from_blob(input_tokens, {1, 4});
17
18    // Execute the model synchronously
19    std::vector<torch::executor::EValue> inputs = {input_tensor};
20    auto outputs = method->execute(inputs);
21
22    // Process results
23    auto result_tensor = outputs[0].toTensor();
24    ET_LOGD("Inference successful, generated output shape: %d", result_tensor.size(1));
25}

Integrating with Native Apps

Integrating the C++ inference engine into a mobile application requires a bridge to the native platform language, such as Swift for iOS or Kotlin for Android. This is typically achieved using the Java Native Interface or Objective-C++ wrappers. These wrappers expose a simplified API to the application layer, hiding the complexities of the underlying C++ runtime.

It is important to run the inference process on a background thread to avoid blocking the main UI thread. Generative tasks, even when optimized, can take several milliseconds to complete. By using asynchronous execution patterns, the application remains responsive, and developers can provide progress updates or streaming outputs to the user.

Handling Model Versioning

As models are updated and improved, managing model versions on-device becomes a critical task. Applications should include a mechanism for downloading and verifying new model files without requiring a full app update. This allows for rapid iteration on the AI features and ensures that users always have access to the latest optimizations.

Checksum verification should be used to ensure the integrity of the downloaded model files before they are loaded by ExecuTorch. Additionally, the application should be able to fall back to a default embedded model if a download fails. This ensures that the core AI functionality remains available even in the face of network or storage issues.

Performance Profiling and Debugging

Optimizing on-device AI is an iterative process that requires careful profiling of the inference pipeline. Developers need to monitor metrics such as time-to-first-token, total inference latency, and peak memory usage. These metrics provide insights into where the bottlenecks are and which parts of the model require further optimization.

ExecuTorch includes built-in profiling tools that can track the execution time of individual operators within the graph. This level of granularity allows developers to identify slow operations that might not be well-supported by the target hardware delegate. In some cases, replacing a specific operation with a more mobile-friendly alternative can lead to significant performance gains.

Debugging on-device can be challenging due to the limited visibility into the hardware state. Using the ExecuTorch Inspector tool, developers can compare the outputs of the on-device model with a reference implementation running on a desktop. This helps in identifying accuracy regressions that might have been introduced during the quantization or optimization stages.

Operator Coverage: Ensure all model operations have a corresponding implementation in the chosen backend.
Memory Fragmentation: Monitor long-running sessions to ensure memory is properly reclaimed.
Thermal Throttling: Check if heavy AI usage causes the device to heat up and slow down performance.
Numerical Stability: Verify that low-precision quantization does not lead to diverging or nonsensical outputs.

Thermal and Power Management

Running intensive machine learning models can consume significant power, leading to battery drain and thermal throttling. Throttling occurs when the device reduces the clock speed of its processors to prevent overheating, which can drastically impact AI performance. Developers must balance the complexity of the model with the thermal constraints of the device.

To mitigate these issues, applications can use adaptive inference strategies, such as switching to a smaller model when the battery is low or the device is warm. Implementing a 'cool-down' period between heavy inference tasks can also help maintain a consistent performance profile. Monitoring device state through platform APIs allows for more intelligent resource management.

Edge Case Handling

Real-world mobile environments present many edge cases, such as insufficient disk space for model storage or unexpected hardware configurations. The application should gracefully handle these scenarios by providing informative error messages and alternative functionality. For example, if a device lacks a compatible NPU, the runtime should automatically fall back to the CPU.

Robust error handling is especially important during the model loading phase. If a .pte file is corrupted or incompatible with the current runtime version, the application must prevent a crash and attempt to recover. Building these safety nets into the architecture ensures a high-quality user experience even on the diverse and unpredictable hardware of the mobile ecosystem.

Benchmarking and Monitoring ML Model Performance on Mobile Devices All On-Device Machine Learning Articles