Model · 7–9 min read · 2026-06-12
SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review
An agentic reviewer that explores the repo, traces the root cause, and hands back structured feedback — turning a one-shot patch into a generate-review-revise loop. SWE-Review-30B-A3B.
A coding agent that resolves an issue submits one patch and stops. If the patch is subtly wrong — fixes the symptom but not the cause, breaks an edge case, touches the wrong file — nothing catches it. Human software engineering doesn’t work that way: someone reviews the change, traces whether it actually addresses the problem, and sends it back with notes. SWE-Review gives agents that second half of the loop.
SWE-Review-30B-A3B is an agentic code-review model. Rather than scoring a diff in isolation, it behaves like a reviewer: it independently explores the repository through tool calls, traces the root cause of the issue on its own, and compares its diagnosis against the submitted PR — then emits a structured review decision with diagnostic feedback a generator can revise against. It is fine-tuned from Qwen3-30B-A3B, a mixture-of-experts model with 30B total but only 3B active parameters, so it reviews at the cost of a small dense model.
The generate-review-revise loop
The unit of work is not a single attempt but a loop. A generator agent produces a candidate patch; SWE-Review explores the repository from scratch, builds its own understanding of what the issue requires, and checks the candidate against that understanding; its feedback goes back to the generator for a revised attempt. The reviewer’s independence is the point — because it traces the root cause itself rather than trusting the generator’s narrative, it catches fixes that look right but aren’t.
How it was trained
The model is fine-tuned on SWE-Review-Traj, a set of 8,914 agentic review trajectories — full tool-calling sessions of a reviewer exploring a repo and reaching a verdict. Training uses LLaMA-Factory with DeepSpeed ZeRO-3, four epochs at a 1e-4 cosine-scheduled learning rate, with context extended to 131,072 tokens via YaRN rope scaling so the reviewer can hold a long exploration in context. The recipe is deliberately the same family of machinery as the rest of LegoFactory — trajectories in, LLaMA-Factory SFT out.
Results on SWE-Review-Bench
SWE-Review-Bench has 1,384 instances, split by the strength of the model that produced the candidate patch — from a strong generator (GLM-5, “high”), whose mistakes are subtle and hard to catch, to a weaker one (Qwen3-30B, “low”), whose mistakes are easier to catch and more valuable to fix. Three measures: Code Review accuracy (right accept/reject decision), Diagnostic Accuracy (correctly identifying the root cause), and the downstream resolve-rate gain when the feedback is fed back for revision.
| Generator split | Code Review | Diagnostic Acc. | Resolve Rate |
|---|---|---|---|
| GLM-5 (high) | 82.0% | 69.0% | +0.4pp |
| Coder-30B (medium) | 83.1% | 70.5% | +2.8pp |
| Qwen3-30B (low) | 87.2% | 76.5% | +8.3pp |
Two things stand out. The reviewer is most accurate, and most useful, exactly where it matters: on the weak-generator split it reaches 87.2% review accuracy, 76.5% diagnostic accuracy, and an +8.3pp resolve-rate gain — the loop turns a mediocre first attempt into a resolved issue far more often. And SFT is what unlocks it: diagnostic accuracy improves by +5.8pp to +30.6pp over the base model across splits, with the largest +30.6pp jump on the hardest tier, where off-the-shelf models are weakest.
Cheaper test-time scaling
Because a good reviewer makes the decision of which attempt to trust, it also makes test-time scaling cheaper. Instead of independently resampling many full solutions, SWE-Review reaches a 38.4% resolve rate with only 2.44 samples on average — 3.7× the test-time-scaling gain at 6.7× the efficiency of naïve resampling. The verifier-style intuition from the SWE-Lego post returns here: selection is cheaper than generation, and a reviewer is a selector with reasons.
Using it
The model serves on vLLM with tensor parallelism of 4. A couple of practical notes from the card: use --tool-call-parser hermes (not qwen3_coder), and for OpenHands set OPENHANDS_LLM_NATIVE_TOOL_CALLING=true. There’s also a plugin so it can act as the reviewer inside Claude Code.
What it means for LegoFactory
LegoFactory’s pipeline ends at evaluation: collect, roll out, fine-tune, measure. SWE-Review points at what comes after the measurement — a review block that doesn’t just score a model but actively improves its output at inference time, and whose own review trajectories are just another stream of validated, agentic data the pipeline already knows how to produce. The generator and the reviewer are both blocks; closing the loop between them is the natural next composition.
The model and dataset are on the Hub: SWE-Review-30B-A3B and SWE-Review-Traj. To run the pipeline that produces this kind of data, start with the docs.