Factual QA #4

Compare the Transformer architecture (as described in the original "Attention Is All You Need" paper) with the Mamba architecture. Cover: key structural differences, computational complexity for sequence length, strengths and weaknesses of each for different task types, and why state space models are gaining interest.

14 models responded

high10.0

Judge: Correctly states Transformer self-attention is O(L^2) and accurately describes Mamba's selective SSM with input-dependent dynamics (both constraints met). Covers all four requested areas with correct complexity analysis and insightful discussion of retrieval vs compression tradeoffs. Comprehensive and clearly structured.

High-level summary Aspect · Original Transformer (“Attention Is All You Need”) · Mamba Core mechanism · Multi-head self-attention · Selective state space model, or SSM Sequence interaction · Explici...

openai/gpt-5.5 · 2026-06-10

DeepSeek V4DeepSeek

high10.0

Judge: Technically precise: O(L^2) attention complexity and a correct description of Mamba's selective (input-dependent A/B/C) state space mechanism, satisfying both constraints. Thorough coverage of structure, complexity, strengths/weaknesses, and SSM interest, with accurate references (Gu & Dao 2023, S4/H3, Long Range Arena). Well-organized comparison.

The Transformer (Vaswani et al., 2017) and Mamba (Gu & Dao, 2023) represent two fundamentally different approaches to sequence modelling. Below is a structured comparison covering structural design, c...

deepseek/deepseek-v4-pro · 2026-06-10

Claude FableAnthropic

high10.0

Judge: Technically precise on both architectures: correctly states O(n^2) attention (and the FlashAttention memory-vs-compute nuance), gives the correct SSM recurrence with Mamba's input-dependent B/C/delta selectivity and the parallel-scan implementation detail. All four requested areas are covered, with genuinely insightful analysis (KV-cache economics, lossy fixed-state recall tradeoff, hybrid architectures like Jamba/Zamba). Among the strongest possible answers to this prompt.

Transformer vs. Mamba: Architectural Comparison Core Structural Differences Transformer (Vaswani et al., 2017) • Self-attention mechanism: Every token computes attention scores against every other t...

anthropic/claude-fable-5 · 2026-06-11

GPT-5.4OpenAI

high9.8

Judge: Correctly states O(n^2) attention complexity, accurately describes Mamba's selective input-dependent state space mechanism, and covers all four requested areas: structure, complexity, task-specific strengths/weaknesses, and why SSMs are gaining interest. The analysis connects architecture to practical consequences (KV-cache memory, streaming inference, the compression bottleneck vs explicit retrieval). Heading levels are slightly inconsistent and the response is long, but it remains easy to follow.

Here’s a concise comparison of the original Transformer from Attention Is All You Need and Mamba-style architectures. 1. Core architectural difference Transformer (original paper) The original Trans...

openai/gpt-5.4 · 2026-06-11

GPT-5.4 miniOpenAI

high9.7

Judge: Accurate, complete comparison covering structure, O(L²) vs ~O(L) complexity, per-task strengths/weaknesses, and SSM interest, plus a clear summary table. States the O(n²) constraint and the selective SSM mechanism correctly. Slightly less penetrating analysis than the strongest answers but solid throughout.

Here’s a concise comparison of the original Transformer architecture from “Attention Is All You Need” and Mamba. 1) Key structural differences Transformer • Built around self-attention plus feed-for...

openai/gpt-5.4-mini · 2026-06-10

DeepSeek R1DeepSeek

high9.4

Judge: Technically precise comparison covering all four requested areas. Correctly states Transformer self-attention is O(n^2), Mamba is O(n) for training and O(1) per step for inference. Accurately describes selective SSM mechanism with input-dependent B and C parameters. The strengths/weaknesses table is insightful, particularly the bidirectionality weakness of SSMs and the KV-cache problem for Transformers. The 'why SSMs are gaining interest' section is well-reasoned. Thorough and well-organized.

Excellent question. This comparison highlights a fundamental shift in sequence modeling paradigms. Let's break down the comparison between the Transformer (Vaswani et al., 2017) and Mamba (Gu & Dao, 2...

deepseek/deepseek-reasoner · 2026-03-02

Claude OpusAnthropic

medium7.8

Run 1 of 2

Judge: Correctly states Transformer self-attention is O(n^2) and describes Mamba's selective state space mechanism. Covers structural differences, computational complexity, and strengths/weaknesses. The Mamba description mentions input-dependent parameters and parallel scan, which are key features. Response truncated before completing the strengths/weaknesses section and the SSM interest discussion. Accuracy is good on both architectures.

Transformer vs. Mamba Architecture Comparison Key Structural Differences Transformer Architecture: • Built entirely on self-attention mechanisms and feed-forward networks • Uses multi-head attention...

anthropic/claude-opus-4-20250514 · 2026-03-01avg: 8.8

Claude SonnetAnthropic

high7.4

Run 1 of 2

Judge: Correctly describes Transformer self-attention as O(n^2) and Mamba's linear complexity. Accurately covers Mamba's selective state space mechanism. Structural comparison is solid. Response is truncated before completing the strengths/weaknesses section and likely before covering 'why SSMs are gaining interest,' but what's visible is technically accurate and well-organized. The hardware-aware design mention for Mamba is a good detail.

Transformer vs. Mamba: A Comprehensive Comparison Key Structural Differences Transformer Architecture • Core Mechanism: Multi-head self-attention with feed-forward networks • Attention Pattern: Each...

anthropic/claude-sonnet-4-20250514 · 2026-03-01avg: 8.4

o4-miniOpenAI

high7.2

Judge: Covers all four requested areas: structural differences, computational complexity, strengths/weaknesses, and why SSMs are gaining interest. Correctly states Transformer self-attention is O(n^2). However, describes Mamba as using 'learnable linear time-invariant recurrence' when Mamba's key innovation is its selective (input-dependent) state space mechanism. This mischaracterization of Mamba's core contribution is a notable factual error.

Hard constraint failed: Describes Mamba as using 'linear time-invariant recurrence' when Mamba's key innovation is its selective (input-dependent) state space mechanism. Fails to describe the selective aspect.

Below is a high‐level comparison between the original “Attention Is All You Need” Transformer and a representative modern “Mamba”-style architecture that centers on State-Space Model (SSM) blocks (oft...

openai/o4-mini · 2026-03-02

Claude HaikuAnthropic

medium7.0

Run 1 of 2

Judge: Accurate on both architectures — correct attention complexity O(n^2), correct SSM formulation for Mamba with selective mechanism. Good visual comparison and table format. Truncated before completing the complexity table, so coverage of strengths/weaknesses and SSM interest may be incomplete. What's visible is technically precise.

Transformer vs. Mamba: A Comprehensive Comparison Key Structural Differences Transformer Architecture Input → Embedding → [Multi-Head Attention + FFN] × N → Output Multi-Head Attention: • Computes...

anthropic/claude-haiku-4-5-20251001 · 2026-03-01avg: 7.5

Gemini FlashGoogle

medium6.9

Run 1 of 2

Judge: Response is truncated — cuts off mid-section heading ('Why State Space Models (SSMs'). Covers Transformer architecture accurately (self-attention, O(n^2) complexity, positional encoding) and describes Mamba's selective state space mechanism. However, the incomplete ending means the 'why SSMs are gaining interest' section is missing. What's present is accurate and well-organized.

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized sequence modeling with its self-attention mechanism. More recently, State Space Models (S...

gemini/gemini-2.5-flash · 2026-03-02avg: 8.4

Phi-4Microsoft

medium5.4

Run 1 of 2

Judge: Correctly states Transformer self-attention is O(n^2) and describes Mamba as using state space models with linear complexity, meeting hard constraints. However, the description of Mamba is vague — mentions 'sparse attention patterns' which isn't accurate for Mamba (it uses selective state spaces, not sparse attention). The name expansion 'Memory-efficient Attention with Low-complexity' appears fabricated. Covers structural differences and complexity well, but truncated before completing strengths/weaknesses analysis.

The Transformer architecture, introduced by Vaswani et al. in the seminal paper "Attention Is All You Need" (2017), and Mamba (Memory-efficient Attention with Low-complexity) represent two approaches ...

ollama/phi4 · 2026-03-01avg: 6.5

GPT-4oOpenAI

high5.0

Judge: Correct on Transformer architecture and O(n^2) complexity, but vague on Mamba -- describes generic state space models rather than Mamba's specific selective scan mechanism. Claims O(n log n) complexity when Mamba achieves O(n). Response truncated before completing strengths/weaknesses.

Hard constraint failed: Does not describe Mamba's selective state space mechanism specifically -- only provides a generic SSM description without addressing the input-dependent selection that distinguishes Mamba.

The Transformer architecture, as introduced in the "Attention Is All You Need" paper by Vaswani et al. in 2017, has become a cornerstone of modern natural language processing due to its efficiency and...

openai/gpt-4o · 2026-03-01

Llama 3Meta

high2.2

Run 1 of 2

Judge: Fails the hard constraint on Mamba -- never describes Mamba's selective state space mechanism. Instead treats Mamba as a generic 'single-layer self-attention' model, which is fundamentally wrong (Mamba uses structured state space models, not attention). Claims Mamba has O(n) complexity for the wrong reasons. Correctly states Transformer attention is O(n^2). The response demonstrates no real understanding of the Mamba architecture.

Hard constraint failed: Must describe Mamba's selective state space mechanism -- never mentioned; incorrectly describes Mamba as using 'single-layer self-attention.'

The Transformer architecture, introduced in the seminal paper "Attention Is All You Need" by Vaswani et al., has revolutionized the field of natural language processing (NLP) and computer vision. It's...

ollama/llama3.1:8b · 2026-03-01avg: 2.3