SSM-Scope: Characterizing State Space Model and Hybrid Language Model Performance with Long Context

Overview

Abstract

Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffer from quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling.

To address this gap, we present a comprehensive, comparative benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9× faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4× faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footprint. Our operator-level analysis reveals that custom SSM kernels like selective scan, despite being hardware-aware to minimize memory I/O, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. To foster further research, we open-source our characterization framework, SSM-Scope.

⚡

4×

SSM Speed Advantage at Long Contexts

At ~57K tokens, SSMs outperform Transformers by up to 4× in TTFT, thanks to linear O(n) complexity.

💾

4×

Extended Context Window

SSMs handle up to 220K tokens on a 24 GB consumer GPU — 4× beyond what optimized Transformers can reach without offloading.

🔋

75%

Energy Reduction

At 57K tokens, Mamba-2 (370 J) consumes ~75% less energy than Qwen2.5 (1492 J) thanks to linear compute scaling.

🔍

>55%

SSM Kernel Dominance on Edge

Selective-scan and related SSM-specific operators account for over 55% of total inference latency on Jetson edge platforms.

Motivation

Why SSMs for Long-Context Inference?

At short sequence lengths, heavily optimized Transformer models such as Qwen2.5 outperform SSMs. However, as context length grows, the quadratic memory and compute cost of attention creates an insurmountable bottleneck. The figure below encapsulates this performance crossover on an RTX 4090, comparing Qwen2.5-0.5B (Transformer) and Mamba2-780m (SSM) across representative prefill and decode workloads.

TTFT and TPOT scaling comparison of Qwen2.5 and Mamba2

Figure 1: TTFT (a, b) and TPOT (c, d) scaling comparison of Qwen2.5-0.5B (Transformer) and Mamba2-780m (SSM). While Qwen is faster (1.9×) at shorter sequence lengths (<8K), Mamba2's superior scaling provides a significant performance advantage (2.65×–3×) at longer contexts for both prefill and decode stages (generation length 256, batch size 1).

Key insight: Beyond ~8K tokens on a consumer RTX 4090, Mamba2 overtakes Qwen2.5 in both TTFT and TPOT. At 32K tokens, Mamba2 is 2.65× faster for TTFT and 3× faster for TPOT — and can continue operating at 220K tokens, where the Transformer runs out of memory.

Background

Architectures & Accuracy–Latency Trade-offs

We evaluate three primary model families: (1) Transformers, which use scaled dot-product attention with KV-cache and O(n²) complexity; (2) SSMs (Mamba, Mamba-2), which use a selective state mechanism with O(n) complexity and constant O(1) memory during generation; and (3) Hybrid models (Zamba2, Hymba, Falcon-H1), which interleave attention and SSM layers to combine accuracy and efficiency.

Transformer and SSM architecture background

Figure 2: (a) Basic building block of Transformers: a scaled dot-product attention module; (b) a S6 block showing the fundamental computation of SSMs; (c) Overview of auto-regressive generation (prefill & decode) in LLMs.

While pure SSMs achieve the fastest inference, they sacrifice accuracy on knowledge-intensive tasks. Hybrid models like Falcon-H1 bridge this gap — achieving accuracy on par with or exceeding Transformer baselines while maintaining a 2.8× TTFT speedup.

Figure 3: Accuracy–latency efficiency frontier of Transformer (Qwen2.5-1.5B), SSM (Mamba2-1.3B), and Hybrid (Falcon-H1-1.5B) models at a 57K-token prefill stage on RTX 4090. Hybrid models achieve Pareto-optimal balance: Falcon-H1 matches/surpasses Transformer accuracy with 2.8× lower TTFT.

Methodology

SSM-Scope: Characterization Framework

SSM-Scope is an open-source, backend-agnostic performance characterization framework for language models. It takes a workload configuration (model registry entry, sequence lengths, batch sizes) and profiling configuration as input, and produces end-to-end and operator-level breakdowns for latency, memory, and energy metrics.

Figure 4: Overall flow of the SSM-Scope characterization framework. The framework includes a flexible model registry, data preprocessing module, and a performance profiling engine that captures operator-level breakdown via PyTorch Profiler or TensorRT's trtexec.

Evaluation Platforms

Specification	GPU 1 — Consumer	GPU 2 — Edge
GPU Model	NVIDIA RTX 4090	NVIDIA Jetson Orin Nano
Architecture	Ada Lovelace	Ampere
Streaming Multiprocessors	128	8
Compute Throughput	~330 TFLOPS	~20 TFLOPS
GPU Memory	24 GB GDDR6X	8 GB LPDDR5 (Shared)
Memory Bandwidth	1,008 GB/s	68 GB/s
Host Interconnect	PCIe 4.0 ×16	Integrated on-chip

Jetson Orin Nano evaluated in MAXN power mode with 16 GB NVMe swap enabled.

Evaluated Models

Architecture	Family	Parameter Sizes	Quantized Checkpoint
Transformer	Qwen2.5	0.5B, 1.5B	✓
	Phi-3-mini	3.82B	✓
	Llama-3.2	1B	✓
	TinyLlama / GPT-Neo	1.1B / 125M	—
SSM	Mamba	130M, 370M, 790M, 1.4B, 2.8B	✓
SSM	Mamba-2	130M, 370M, 780M, 1.4B, 2.8B	✓
Hybrid	Zamba2	1.2B, 2.7B	—
	Hymba	1.5B	—
	Falcon-H1	0.5B, 1.5B	—

Case Study I

Memory Footprint & Scaling Limits

We characterize peak GPU memory across sequence lengths from 1K to over 220K tokens. Transformers (Qwen2.5, Llama3.2) reach their out-of-memory (OOM) limits at 57K–65K tokens on the RTX 4090 due to the growing KV-cache. By contrast, SSMs carry no KV-cache and their activation memory scales linearly at a much lower rate, allowing Mamba and Mamba-2 (~790M) to run at up to 220K tokens within 24 GB. Falcon-H1 extends to ~164K tokens, far beyond same-accuracy Transformer models. On the edge Jetson platform, SSMs and hybrids again show clear advantage, supporting contexts >16K where Transformers cannot operate.

Memory footprint scaling across sequence lengths

Figure 5: Memory footprint of the prefill stage for Transformer, SSM, and Hybrid models on (a) consumer GPU (RTX 4090) and (b) edge GPU (Jetson Orin Nano). Stacked bars show weight, KV-cache, and activation memory. Bars stop when OOM is hit. SSMs operate at up to 4× longer context than the best-performing Transformer.

Case Study II

Energy Consumption & End-to-End Throughput

We measure total inference energy (J) and end-to-end token throughput (tokens/s) across contexts from 1K to 57K tokens on the RTX 4090, for three representative models: Qwen2.5-0.5B (Transformer), Mamba2-780m (SSM), and Falcon-H1-0.5B (Hybrid).

At short sequences (<16K), Transformers benefit from highly optimized GEMM kernels and consume comparable or less energy. However, beyond this threshold the quadratic compute burden reverses the trend dramatically. At 57K tokens, Qwen2.5 consumes 1,492 J — approximately 4× more than Mamba-2 (370 J). The hybrid Falcon-H1 (613 J) offers a balanced profile: 2.4× more energy-efficient than Transformer while maintaining higher accuracy. For throughput, Mamba2 achieves 2.64× and Falcon-H1 achieves 1.54× the throughput of Qwen2.5 at 32K sequence length.

Inference energy consumption and throughput

Figure 6: (a) Inference energy consumption (log scale) and (b) end-to-end throughput of Qwen2.5-0.5B, Mamba2-780m, and Falcon-H1-0.5B over sequence length on RTX 4090. SSMs and hybrids achieve dramatically better energy efficiency and throughput at long contexts.

Case Study III

Operator-Level Performance Analysis

To understand the runtime composition of SSMs and hybrid models, we classify all operators into three categories: SSM-specific kernels (e.g., mamba_inner_fn, mamba_split_conv1d_scan_combined_fn), GEMM-based operators (e.g., matmul, linear), and non-GEMM operators (normalization, memory layout, activations). We profile latency breakdown across sequence lengths from 256 to 131K tokens on both the consumer and edge GPUs.

Operator-level latency breakdown for Mamba and Mamba-2

Figure 7: Latency breakdown of SSMs by operator class on RTX 4090 for (a) Mamba-130M (first generation) and (b) Mamba-2 (second generation). As sequence length increases, GEMM operators grow while SSM-specific kernels remain a near-constant dominant share. Mamba-2's larger state dimensions (64 vs. 16) make its selective scan operator more expensive.

Key Takeaways from Operator Analysis

🖥️

SSMs on Consumer GPU

SSM-specific kernels remain a near-constant dominant fraction of runtime regardless of context length. GEMM share grows, but SSM kernels define the optimization target.

📱

SSMs on Edge GPU (Jetson)

SSM operators account for >55% of latency on Jetson across all tested sequence lengths — even more dominant than on the consumer GPU for Mamba-1.

🔀

Hybrid Models

Unlike SSMs, hybrid models (Zamba2, Hymba) do not show a single dominant operator class. Bottlenecks are model-specific — underscoring the need for per-model profiling tools like SSM-Scope.

🏗️

Cross-Platform Portability

For hybrid models, optimization gains on one platform (consumer GPU) transfer broadly to the other (edge GPU), since operator breakdown trends remain consistent across devices.

Reference

Citation

If you find SSM-Scope or this work useful in your research, please consider citing:

@article{mitra2025characterizing,
  title   = {Characterizing State Space Model (SSM) and SSM-Transformer Hybrid
             Language Model Performance with Long Context Length},
  author  = {Mitra, Saptarshi and Karami, Rachid and Xu, Haocheng and
             Huang, Sitao and Kwon, Hyoukjun},
  journal = {arXiv preprint arXiv:2507.12442},
  year    = {2025}
}

Characterizing State Space Model and Hybrid Language Model Performancewith Long Context