ISPASS 2026  |  IEEE Int'l Symposium on Performance Analysis of Systems and Software

Characterizing State Space Model and
Hybrid Language Model Performance
with Long Context

University of California, Irvine  ·  Department of Electrical Engineering & Computer Science

Abstract

Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffer from quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling.


To address this gap, we present a comprehensive, comparative benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9× faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4× faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footprint. Our operator-level analysis reveals that custom SSM kernels like selective scan, despite being hardware-aware to minimize memory I/O, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. To foster further research, we open-source our characterization framework, SSM-Scope.

SSM Speed Advantage at Long Contexts

At ~57K tokens, SSMs outperform Transformers by up to 4× in TTFT, thanks to linear O(n) complexity.

💾

Extended Context Window

SSMs handle up to 220K tokens on a 24 GB consumer GPU — 4× beyond what optimized Transformers can reach without offloading.

🔋
75%

Energy Reduction

At 57K tokens, Mamba-2 (370 J) consumes ~75% less energy than Qwen2.5 (1492 J) thanks to linear compute scaling.

🔍
>55%

SSM Kernel Dominance on Edge

Selective-scan and related SSM-specific operators account for over 55% of total inference latency on Jetson edge platforms.

Why SSMs for Long-Context Inference?

At short sequence lengths, heavily optimized Transformer models such as Qwen2.5 outperform SSMs. However, as context length grows, the quadratic memory and compute cost of attention creates an insurmountable bottleneck. The figure below encapsulates this performance crossover on an RTX 4090, comparing Qwen2.5-0.5B (Transformer) and Mamba2-780m (SSM) across representative prefill and decode workloads.

TTFT and TPOT scaling comparison of Qwen2.5 and Mamba2
Figure 1: TTFT (a, b) and TPOT (c, d) scaling comparison of Qwen2.5-0.5B (Transformer) and Mamba2-780m (SSM). While Qwen is faster (1.9×) at shorter sequence lengths (<8K), Mamba2's superior scaling provides a significant performance advantage (2.65×–3×) at longer contexts for both prefill and decode stages (generation length 256, batch size 1).
  Key insight: Beyond ~8K tokens on a consumer RTX 4090, Mamba2 overtakes Qwen2.5 in both TTFT and TPOT. At 32K tokens, Mamba2 is 2.65× faster for TTFT and faster for TPOT — and can continue operating at 220K tokens, where the Transformer runs out of memory.

Architectures & Accuracy–Latency Trade-offs

We evaluate three primary model families: (1) Transformers, which use scaled dot-product attention with KV-cache and O(n²) complexity; (2) SSMs (Mamba, Mamba-2), which use a selective state mechanism with O(n) complexity and constant O(1) memory during generation; and (3) Hybrid models (Zamba2, Hymba, Falcon-H1), which interleave attention and SSM layers to combine accuracy and efficiency.

Transformer and SSM architecture background
Figure 2: (a) Basic building block of Transformers: a scaled dot-product attention module; (b) a S6 block showing the fundamental computation of SSMs; (c) Overview of auto-regressive generation (prefill & decode) in LLMs.

While pure SSMs achieve the fastest inference, they sacrifice accuracy on knowledge-intensive tasks. Hybrid models like Falcon-H1 bridge this gap — achieving accuracy on par with or exceeding Transformer baselines while maintaining a 2.8× TTFT speedup.

Accuracy-latency efficiency frontier
Figure 3: Accuracy–latency efficiency frontier of Transformer (Qwen2.5-1.5B), SSM (Mamba2-1.3B), and Hybrid (Falcon-H1-1.5B) models at a 57K-token prefill stage on RTX 4090. Hybrid models achieve Pareto-optimal balance: Falcon-H1 matches/surpasses Transformer accuracy with 2.8× lower TTFT.

SSM-Scope: Characterization Framework

SSM-Scope is an open-source, backend-agnostic performance characterization framework for language models. It takes a workload configuration (model registry entry, sequence lengths, batch sizes) and profiling configuration as input, and produces end-to-end and operator-level breakdowns for latency, memory, and energy metrics.

SSM-Scope characterization framework flow
Figure 4: Overall flow of the SSM-Scope characterization framework. The framework includes a flexible model registry, data preprocessing module, and a performance profiling engine that captures operator-level breakdown via PyTorch Profiler or TensorRT's trtexec.

Evaluation Platforms

Specification GPU 1 — Consumer GPU 2 — Edge
GPU ModelNVIDIA RTX 4090NVIDIA Jetson Orin Nano
ArchitectureAda LovelaceAmpere
Streaming Multiprocessors1288
Compute Throughput~330 TFLOPS~20 TFLOPS
GPU Memory24 GB GDDR6X8 GB LPDDR5 (Shared)
Memory Bandwidth1,008 GB/s68 GB/s
Host InterconnectPCIe 4.0 ×16Integrated on-chip

Jetson Orin Nano evaluated in MAXN power mode with 16 GB NVMe swap enabled.

Evaluated Models

Architecture Family Parameter Sizes Quantized Checkpoint
TransformerQwen2.50.5B, 1.5B
Phi-3-mini3.82B
Llama-3.21B
TinyLlama / GPT-Neo1.1B / 125M
SSMMamba130M, 370M, 790M, 1.4B, 2.8B
Mamba-2130M, 370M, 780M, 1.4B, 2.8B
HybridZamba21.2B, 2.7B
Hymba1.5B
Falcon-H10.5B, 1.5B

Memory Footprint & Scaling Limits

We characterize peak GPU memory across sequence lengths from 1K to over 220K tokens. Transformers (Qwen2.5, Llama3.2) reach their out-of-memory (OOM) limits at 57K–65K tokens on the RTX 4090 due to the growing KV-cache. By contrast, SSMs carry no KV-cache and their activation memory scales linearly at a much lower rate, allowing Mamba and Mamba-2 (~790M) to run at up to 220K tokens within 24 GB. Falcon-H1 extends to ~164K tokens, far beyond same-accuracy Transformer models. On the edge Jetson platform, SSMs and hybrids again show clear advantage, supporting contexts >16K where Transformers cannot operate.

Memory footprint scaling across sequence lengths
Figure 5: Memory footprint of the prefill stage for Transformer, SSM, and Hybrid models on (a) consumer GPU (RTX 4090) and (b) edge GPU (Jetson Orin Nano). Stacked bars show weight, KV-cache, and activation memory. Bars stop when OOM is hit. SSMs operate at up to 4× longer context than the best-performing Transformer.

Energy Consumption & End-to-End Throughput

We measure total inference energy (J) and end-to-end token throughput (tokens/s) across contexts from 1K to 57K tokens on the RTX 4090, for three representative models: Qwen2.5-0.5B (Transformer), Mamba2-780m (SSM), and Falcon-H1-0.5B (Hybrid).

At short sequences (<16K), Transformers benefit from highly optimized GEMM kernels and consume comparable or less energy. However, beyond this threshold the quadratic compute burden reverses the trend dramatically. At 57K tokens, Qwen2.5 consumes 1,492 J — approximately 4× more than Mamba-2 (370 J). The hybrid Falcon-H1 (613 J) offers a balanced profile: 2.4× more energy-efficient than Transformer while maintaining higher accuracy. For throughput, Mamba2 achieves 2.64× and Falcon-H1 achieves 1.54× the throughput of Qwen2.5 at 32K sequence length.

Inference energy consumption and throughput
Figure 6: (a) Inference energy consumption (log scale) and (b) end-to-end throughput of Qwen2.5-0.5B, Mamba2-780m, and Falcon-H1-0.5B over sequence length on RTX 4090. SSMs and hybrids achieve dramatically better energy efficiency and throughput at long contexts.

Operator-Level Performance Analysis

To understand the runtime composition of SSMs and hybrid models, we classify all operators into three categories: SSM-specific kernels (e.g., mamba_inner_fn, mamba_split_conv1d_scan_combined_fn), GEMM-based operators (e.g., matmul, linear), and non-GEMM operators (normalization, memory layout, activations). We profile latency breakdown across sequence lengths from 256 to 131K tokens on both the consumer and edge GPUs.

Operator-level latency breakdown for Mamba and Mamba-2
Figure 7: Latency breakdown of SSMs by operator class on RTX 4090 for (a) Mamba-130M (first generation) and (b) Mamba-2 (second generation). As sequence length increases, GEMM operators grow while SSM-specific kernels remain a near-constant dominant share. Mamba-2's larger state dimensions (64 vs. 16) make its selective scan operator more expensive.

Key Takeaways from Operator Analysis

🖥️

SSMs on Consumer GPU

SSM-specific kernels remain a near-constant dominant fraction of runtime regardless of context length. GEMM share grows, but SSM kernels define the optimization target.

📱

SSMs on Edge GPU (Jetson)

SSM operators account for >55% of latency on Jetson across all tested sequence lengths — even more dominant than on the consumer GPU for Mamba-1.

🔀

Hybrid Models

Unlike SSMs, hybrid models (Zamba2, Hymba) do not show a single dominant operator class. Bottlenecks are model-specific — underscoring the need for per-model profiling tools like SSM-Scope.

🏗️

Cross-Platform Portability

For hybrid models, optimization gains on one platform (consumer GPU) transfer broadly to the other (edge GPU), since operator breakdown trends remain consistent across devices.

Citation

If you find SSM-Scope or this work useful in your research, please consider citing:

@article{mitra2025characterizing,
  title   = {Characterizing State Space Model (SSM) and SSM-Transformer Hybrid
             Language Model Performance with Long Context Length},
  author  = {Mitra, Saptarshi and Karami, Rachid and Xu, Haocheng and
             Huang, Sitao and Kwon, Hyoukjun},
  journal = {arXiv preprint arXiv:2507.12442},
  year    = {2025}
}