Factual QA
Factual QA #4

Compare the Transformer architecture (as described in the original "Attention Is All You Need" paper) with the Mamba architecture. Cover: key structural differences, computational complexity for sequence length, strengths and weaknesses of each for different task types, and why state space models are gaining interest.

9 models responded

DeepSeek R1DeepSeek
high9.4

Judge: Technically precise comparison covering all four requested areas. Correctly states Transformer self-attention is O(n^2), Mamba is O(n) for training and O(1) per step for inference. Accurately describes selective SSM mechanism with input-dependent B and C parameters. The strengths/weaknesses table is insightful, particularly the bidirectionality weakness of SSMs and the KV-cache problem for Transformers. The 'why SSMs are gaining interest' section is well-reasoned. Thorough and well-organized.

Excellent question. This comparison highlights a fundamental shift in sequence modeling paradigms. Let's break down the comparison between the **Transformer (Vaswani et al., 2017)** and **Mamba (Gu & ...
deepseek/deepseek-reasoner · 2026-03-02
Claude OpusAnthropic
medium7.8

Judge: Correctly states Transformer self-attention is O(n^2) and describes Mamba's selective state space mechanism. Covers structural differences, computational complexity, and strengths/weaknesses. The Mamba description mentions input-dependent parameters and parallel scan, which are key features. Response truncated before completing the strengths/weaknesses section and the SSM interest discussion. Accuracy is good on both architectures.

## Transformer vs. Mamba Architecture Comparison ### Key Structural Differences **Transformer Architecture:** - Built entirely on self-attention mechanisms and feed-forward networks - Uses multi-hea...
anthropic/claude-opus-4-20250514 · 2026-03-01
Claude SonnetAnthropic
high7.4

Judge: Correctly describes Transformer self-attention as O(n^2) and Mamba's linear complexity. Accurately covers Mamba's selective state space mechanism. Structural comparison is solid. Response is truncated before completing the strengths/weaknesses section and likely before covering 'why SSMs are gaining interest,' but what's visible is technically accurate and well-organized. The hardware-aware design mention for Mamba is a good detail.

# Transformer vs. Mamba: A Comprehensive Comparison ## Key Structural Differences ### Transformer Architecture - **Core Mechanism**: Multi-head self-attention with feed-forward networks - **Attentio...
anthropic/claude-sonnet-4-20250514 · 2026-03-01
o4-miniOpenAI
high7.2

Judge: Covers all four requested areas: structural differences, computational complexity, strengths/weaknesses, and why SSMs are gaining interest. Correctly states Transformer self-attention is O(n^2). However, describes Mamba as using 'learnable linear time-invariant recurrence' when Mamba's key innovation is its selective (input-dependent) state space mechanism. This mischaracterization of Mamba's core contribution is a notable factual error.

Hard constraint failed: Describes Mamba as using 'linear time-invariant recurrence' when Mamba's key innovation is its selective (input-dependent) state space mechanism. Fails to describe the selective aspect.
Below is a high‐level comparison between the original “Attention Is All You Need” Transformer and a representative modern “Mamba”-style architecture that centers on State-Space Model (SSM) blocks (oft...
openai/o4-mini · 2026-03-02
Claude HaikuAnthropic
medium7.0

Judge: Accurate on both architectures — correct attention complexity O(n^2), correct SSM formulation for Mamba with selective mechanism. Good visual comparison and table format. Truncated before completing the complexity table, so coverage of strengths/weaknesses and SSM interest may be incomplete. What's visible is technically precise.

# Transformer vs. Mamba: A Comprehensive Comparison ## Key Structural Differences ### Transformer Architecture ``` Input → Embedding → [Multi-Head Attention + FFN] × N → Output Multi-Head Attentio...
anthropic/claude-haiku-4-5-20251001 · 2026-03-01
medium6.9

Judge: Response is truncated — cuts off mid-section heading ('Why State Space Models (SSMs'). Covers Transformer architecture accurately (self-attention, O(n^2) complexity, positional encoding) and describes Mamba's selective state space mechanism. However, the incomplete ending means the 'why SSMs are gaining interest' section is missing. What's present is accurate and well-organized.

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized sequence modeling with its self-attention mechanism. More recently, State Space Models (S...
gemini/gemini-2.5-flash · 2026-03-02
Phi-4Microsoft
medium5.4

Judge: Correctly states Transformer self-attention is O(n^2) and describes Mamba as using state space models with linear complexity, meeting hard constraints. However, the description of Mamba is vague — mentions 'sparse attention patterns' which isn't accurate for Mamba (it uses selective state spaces, not sparse attention). The name expansion 'Memory-efficient Attention with Low-complexity' appears fabricated. Covers structural differences and complexity well, but truncated before completing strengths/weaknesses analysis.

The Transformer architecture, introduced by Vaswani et al. in the seminal paper "Attention Is All You Need" (2017), and Mamba (Memory-efficient Attention with Low-complexity) represent two approaches...
ollama/phi4 · 2026-03-01
GPT-4oOpenAI
high5.0

Judge: Correct on Transformer architecture and O(n^2) complexity, but vague on Mamba -- describes generic state space models rather than Mamba's specific selective scan mechanism. Claims O(n log n) complexity when Mamba achieves O(n). Response truncated before completing strengths/weaknesses.

Hard constraint failed: Does not describe Mamba's selective state space mechanism specifically -- only provides a generic SSM description without addressing the input-dependent selection that distinguishes Mamba.
The Transformer architecture, as introduced in the "Attention Is All You Need" paper by Vaswani et al. in 2017, has become a cornerstone of modern natural language processing due to its efficiency and...
openai/gpt-4o · 2026-03-01
high2.2

Judge: Fails the hard constraint on Mamba -- never describes Mamba's selective state space mechanism. Instead treats Mamba as a generic 'single-layer self-attention' model, which is fundamentally wrong (Mamba uses structured state space models, not attention). Claims Mamba has O(n) complexity for the wrong reasons. Correctly states Transformer attention is O(n^2). The response demonstrates no real understanding of the Mamba architecture.

Hard constraint failed: Must describe Mamba's selective state space mechanism -- never mentioned; incorrectly describes Mamba as using 'single-layer self-attention.'
The Transformer architecture, introduced in the seminal paper "Attention Is All You Need" by Vaswani et al., has revolutionized the field of natural language processing (NLP) and computer vision. It's...
ollama/llama3.1:8b · 2026-03-01