Motivation

Large language Models with Mix-of-experts architectures demonstrate impressive performance across diverse tasks. Due to their computational sparsity, they are not only more expressive with fixed number of activated experts but also more efficient during training and inference.

Though many efforts have been made to accelerate MoE inference for cloud environment where user request are processed in batch, deploying MoE models on edge devices like mobile phones or single-board computers still remain challenging, mainly due to the highly limited computation and memory resource on edge and extensified ARI gap between prefiling and decoding in single-batch-size scenarios, which leads to long prefiling latency and low decoding throughput.

With the development of hardwares, edge devices start to have powerful accelerators like neural processing unit (NPU) which can offer tens of TOPS of integer computation with low power consumption. Recent studies on accelerating dense LLM inference with mobile NPUs show exciting results where prefilling speed can reach over 1000 toks/s. However, despite of all these achievements, currently there exists no inference systems supporting MoE inference with commercial NPU on edge devices. Motivated by this, we start this project and decide to build our CPU-NPU heterogeneous inference system for MoEs.

Challenges

  1. Dynamic token routing: Before the MoE layer, tokens are dynamically routed according to a gating network. It causes variant execution path of the network and uncertain token assignment to different experts, which prevents static model tracing and graph compilation on NPUs.

  2. Vanishing computation sparsity during prefilling: During prefilling, multiple subsequent tokens are processed together while each tokens are routed differently, leading to simutanious activation of multiple experts which places greater pressure on the already constrained memory resources of edge devices.

  3. High parameter moving overhead during decoding: For single-request autoregressive inference, the computation per decoding step is too limited to amortize the cost of parameter movement, leading to substantial memory-traffic overhead. Traditional Techniques like parameter prefetching also fails due to the uncertain expert activation.

Our design

  1. Expert-wise token buffering and CPU-NPU asynchronization: After chunk-wise attention computation, tokens are not fed into MoE layers immediately. Instead, they are routed to buffers with respect to their routing destination. Each expert maintains a buffer and when it is full, all the tokens in the buffer will be processed together. With this approach, the subgraphs for each expert with different shapes can be compiled ahead of time, enabling efficient NPU execution.

  2. Subgraph caching and pre-fetching: We build a subgraph cache with LRU eviction to maximize subgraph-reusing across prompt chunks. A simple threshold-based prefetching method is also designed to overlap the subgraph IO overhead.

  3. Routing pattern prediction and incremental pre-loading: In decoding phase, we use a pre-trained lightweighted routing-pattern predictor to speculate on the routing result of the current token. The result if a activation pattern which contains more experts than the actually actiated ones. During attention computation, we will load the possible experts that are not in the cache incrementally to reduce expert loading overhead as much as possible.

Progress Record

2026.3.15: Inference Profiling of Qwen3-0.6B on QNN

We implement four distinct profiling mechanisms, and they are not interchangeable because some of them measure host-side CPU work in microseconds while others expose backend-native QNN and HTP events whose units can be microseconds, cycles, bytes, or counts. The key switch is the profiling handle created in QNNRuntime::initRuntime(), where the backend enables either basic or detailed QNN profiling:

if (ProfilingLevel::BASIC == profilingLevel) {
  qnnInterface.profileCreate(backendHandle, QNN_PROFILE_LEVEL_BASIC, &profileHandle)
} else if (ProfilingLevel::DETAILED == profilingLevel) {
  qnnInterface.profileCreate(backendHandle, QNN_PROFILE_LEVEL_DETAILED, &profileHandle)
}

That handle is then consumed later by contextCreateFromBinary, graph finalization, and graph execution, and the resulting events are harvested in extractBackendProfilingInfo().

What measures CPU time in microseconds

The host-side mechanism is the local wall-clock timer based on getTimestampInUs(), which wraps std::chrono::steady_clock and therefore always produces host microseconds rather than device cycles. This timing is used to bracket CPU work such as context binary reading, system context creation, metadata extraction, input tensor marshaling, output tensor unmarshaling, and even the overhead of extracting QNN profiling data itself. A representative example appears in QNNRuntime::retrieveContext():

auto stageStartUs = getTimestampInUs();
...
recordHostStage("QNNRuntime::retrieveContext/readBinary", stageStartUs);

What measures QNN backend time, and where cycles appear

The native QNN profiling events come from the backend profile handle and are fetched through profileGetEvents, profileGetEventData, profileGetExtendedEventData, and profileGetSubEvents, as implemented in fillCollectedProfileEvent() and collectProfileEventsRecursive(). These are genuine QNN-side events, not host timers, and their unit is decoded by profileEventUnitToString(), which explicitly supports “us”, “cycles”, “bytes”, “count”, and others.

The QNN events that report microseconds include generic execution phases such as EXECUTE_QUEUE_WAIT, EXECUTE_PREPROCESS, EXECUTE_DEVICE, and EXECUTE_POSTPROCESS, as well as HTP-specific events like HTP_GRAPH_EXECUTE_HOST_RPC_US, HTP_GRAPH_EXECUTE_HTP_RPC_US, HTP_GRAPH_EXECUTE_ACCEL_US, HTP_GRAPH_EXECUTE_MISC_ACCEL_US, and several yield- and resource-related timings, all mapped in profileEventTypeToString().

The QNN event that explicitly reports cycles is QNN_HTP_PROFILE_EVENTTYPE_GRAPH_EXECUTE_ACCEL_TIME_CYCLE, which is rendered as “HTP_GRAPH_EXECUTE_ACCEL_CYCLES” in profileEventTypeToString(). More generally, any event whose unit is QNN_PROFILE_EVENTUNIT_CYCLES is backend-native cycle data, and that is the only true cycle source in this file. In other words, cycles come from QNN/HTP profiling events, not from the host timers.

The detailed QNN profiling path can also enable optrace and event caps through profileSetConfig(), using environment variables such as MLLM_QNN_PROFILE_OPTRACE and MLLM_QNN_PROFILE_MAX_EVENTS. Functionally, this expands the granularity of backend events and controls how much profiling data is retained.

What each profiling output is for

The textual summary is built in extractBackendProfilingInfo(), where the code prints summary events, aggregates per-node totals, and reports top node breakdowns. This is mainly for human-readable inspection of where execution time or cycles are accumulating, especially at the node level.

The serialized QNN profiling log is created through createProfileSerializationTarget() and written through systemProfileSerializeEventData(...) in extractBackendProfilingInfo() . This output is QNN-native profiling data meant for downstream tooling or offline inspection, and it preserves the backend/system profiling structure rather than flattening everything into plain text.

The Chrome trace export is initialized in initializeChrometraceExport() and finalized in writeChrometraceFile(). Its role is to merge host spans and QNN events into one time-aligned timeline, assigning separate synthetic threads such as “Host”, “QNN Execute”, “QNN Device”, “QNN Node Depth N”, and “QNN Trace Depth N” through chrometraceThreadForMethod(), chrometraceThreadForEvent(), and chrometraceThreadName(). Functionally, this is the best mechanism in the file for understanding overlap and phase boundaries across host preparation, QNN submission, device execution, and backend internal stages.