8 April 2026
Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Day 8 of Inkhaven: 30 Days of Posts
TL;DR
The GSM-Symbolic paper (ICLR 2025) purported to show that language models rely on pattern matching rather than genuine reasoning by demonstrating that perturbing the questions to make them break the pattern of the original question would catastrophically reduce performance in the model. Running the results again in March 2026 with GPT4o, Claude Opus 4.6, and Claude Haiku 4.5 shows that we precisely replicate the original failures in reasoning only when we do not audit out examples that may actually be ambiguous for the model. When we carefully remove samples that may genuinely impact the calculation, the effect is drastically reduced. This calls into question the content of the original dataset and the claims of that work.
Introduction
This result likely comes as little surprise, but I have seen this paper and results shared triumphantly as recently as last week as though it still applied to current models, and this made me sufficiently frustrated that I wanted to see exactly how far they've come with the hope of setting the record a little more straight. I also haven't seen any follow up work explicitly demonstrating that this is no longer the case.
Background
In October 2024, Mirzadeh et al. from Apple published "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models." The paper made three claims:
-
LLMs show noticeable variance when the same question is asked with different names and values.
-
Performance declines when numerical values are altered.
-
Adding irrelevant information ("No-Op" clauses) causes catastrophic drops of up to 65%, suggesting LLMs pattern-match rather than reason.
The paper was published at ICLR 2025 and continues to be a commonly referenced and discussed paper. However, the models it evaluated (GPT-4o, Llama 3 8B, Phi-3, Gemma 2, etc.) are now around 18 months old. We wanted to test if the results still hold.
Original Paper Results (2024 Models)
For context, here are selected results from the original paper:
| Model (2024) | GSM-Symbolic (%) | GSM-NoOp (%) |
|---|---|---|
| GPT-4o | 94.9 | 63.0 |
| Phi-3-mini | 82.1 | 22.5 |
| Gemma2-9b | 79.1 | 16.1 |
| Llama3-8b | 74.6 | 17.2 |
The original results were striking. Even GPT-4o, the strongest model tested, dropped from 94.9% to 63.0% when irrelevant confounders were added.
Methodology
We followed the original paper's evaluation protocol as closely as possible:
-
GSM-Symbolic: We used the official dataset from Apple's repository and use only 1/5th of the 5k sample dataset due to compute constraints. We evaluate 1,000 questions (100 templates x 10 instances) per model.
-
GSM-NoOp: Apple did not release the GSM-NoOp dataset. We generated our own using a three-stage pipeline:
-
Generation: Claude Opus 4.6 generated 917 distractor clauses, each adding one irrelevant-but-plausible numerical statement to a GSM-Symbolic question. The correct answer remains unchanged. We provided all three NoOp examples from the original paper as guidance for the style of distractor.
-
Audit: A separate Claude Opus 4.6 instance (fresh context) classified each distractor as IRRELEVANT, AMBIGUOUS, or RELEVANT. The audit prompt was informed by failure modes we identified such as changes in size, volume, or capacity constraints that would actually impact calculations.
-
Filtering: Only samples classified as TRULY_IRRELEVANT were retained: 289 out of 917 (31.5%). The remaining 68.5% were not included for primary evaluations because a reasonable solver might include these factors in the calculation.
-
-
We found that this filtering step was genuinely challenging, with most of the samples generated by Claude Opus 4.6 needing to be filtered out even when having been specifically prompted to avoid ambiguous samples.
-
Examples of distractors that passed our audit:
- "though 3 of the remoras were a bit smaller than average for their species" (size does not change stated length)
- "She is using a brand new pan she bought yesterday for $12" (pan cost is irrelevant to counting popcorn)
- "He has already played 6 rounds today without rolling a single 12" (previous rolls do not affect probability)
-
Examples of distractors that were filtered out as ambiguous:
- "5 of the kernels landed outside the skillet" (plausibly changes the count of viable kernels, since she can't eat what fell on the stove)
- "15 of the pineapples are still unripe" (plausibly reduces harvestable count)
- "3 remoras overlapping by 6 inches" (Given the question asks about combined length of remoras this plausibly changes combined visible length)
-
Prompting: 8-shot chain-of-thought with the standard GSM8K shots from the lm-evaluation-harness, matching the paper's format exactly. For many of the reasoning models the only available temperature was 1, so where possible we used 0 to match the original paper and where not possible we used 1.
-
Answer extraction: We take only the last number in the response, as per the original paper.
Results
We evaluated GPT4o, Claude Opus 4.6, and Claude Haiku 4.5, capturing the best performer from the original paper, the current frontier model, and a small but reasonably strong reasoning model.
| Model | GSM-Sym | Opus NoOp unfiltered score | Unfiltered drop | Opus NoOp filtered score | Filtered drop |
|---|---|---|---|---|---|
| Opus 4.6 | 100.0% | 76.4% | -23.6% | 96.9% | -3.1% |
| Haiku 4.5 | 100.0% | 69.8% | -30.2% | 90.7% | -9.3% |
| GPT-4o | 95.2% | 65.1% | -30.1% | 89.6% | -5.6% |
We see a clear pattern in which the unfiltered case drops by 24-30% (matching the original paper's findings). When filtered we see drops of only 3-9%. The effect is replicated only in the case of samples that we found were legitimately ambiguous to the model.
Discussion and Caveats
Our NoOp data was LLM-generated. Apple did not release the original GSM-NoOp dataset. We generated our distractor clauses using Claude Opus 4.6. These may be systematically easier or different from Apple's hand-crafted distractors. We cannot make a direct apples-to-apples comparison on NoOp. However, the fact that our unfiltered dataset exhibits almost the exact same drop as the original paper is reasonable evidence that the distribution of the data was very similar to ours.
In the original paper, the authors do not mention whether the data was generated by hand, by LLM, whether any auditing was applied or quality controls were used. Because the data was not shared we cannot confirm either way.
We do observe that in our dataset that aims to measure exclusively irrelevant but potentially misleading examples only a minor decrease to the models scores (between 3-9%).
We used 1,000 questions instead of the full 5000. The original paper used 100 templates x 50 instances = 5,000 questions. We used 100 templates x 10 instances = 1,000. This is sufficient to demonstrate the accuracy levels observed but provides less statistical power.
This doesn't mean LLMs can "truly reason." GSM-Symbolic tests grade-school math. Saturating this benchmark tells us these models handle elementary arithmetic robustly, not that they can do advanced mathematics or formal reasoning. Harder benchmarks (FrontierMath, ARC-AGI-2, Humanity's Last Exam) exist to test true reasoning abilities in models.
Conclusion
The GSM-Symbolic paper made valid and important observations about the models available in mid-2024. Those observations no longer describe the current frontier. The original claims of the paper seem somewhat questionable given that a quality audit seems to remove much of the harm done by the introduction of confounders. We believe that given the reach of this paper in shaping the discourse that sharing the NoOp dataset and allowing true independent replication of the original findings to confirm the original claims that models cannot reason.