← Blog

Paper · 8–10 min read · 2026-06-11

SWE-Lego: Pushing the Limits of Supervised Fine-Tuning for Software Issue Resolving

A close read of the paper behind the pipeline — and why a careful SFT-only recipe still sets the pace for open issue-resolving models. arXiv:2601.01426.


Most of the recent progress on autonomous software engineering — agents that read a bug report, navigate a repository, edit code, and run tests until the suite goes green — has come wrapped in elaborate training stacks: a round of supervised fine-tuning, then reward modeling, then one or more rounds of reinforcement learning, each with its own infrastructure and failure modes. SWE-Lego asks a deliberately unfashionable question: how far can you get with supervised fine-tuning alone, if you do every part of it carefully?

The answer, it turns out, is far. On SWE-bench Verified — the human-filtered slice of SWE-bench where every task has a known-good patch and a reliable test — an 8B model trained with the SWE-Lego recipe resolves 42.2% of issues on its own, rising to 49.6% with test-time scaling. The 32B model reaches 52.6%, and 58.8% once you let it think across sixteen candidate solutions. Those are state-of-the-art numbers among open models of comparable size, reached without a single step of RL.

Left: SWE-Lego models plotted against other open issue-resolving systems by model size on SWE-bench Verified. Right: a bar chart breaking the 32B result down by ingredient.
Figure 1 (from the paper). Left: SWE-Lego models against open issue-resolving systems, by size. Right: the contribution of each ingredient — the dataset, refined SFT, and TTS@16 — building the 32B result from a 23.2% baseline to 58.8%.

This is the paper that motivates LegoFactory. The pipeline you see on the rest of this site — instance collection, trajectory rollout, SFT, evaluation — is the productionized, agent-operated form of the recipe described below. So it is worth understanding what the recipe actually is.

The bet: SFT is data-bound, not method-bound

The implicit claim behind a complex RL stack is that supervised learning has been wrung dry — that to go further you need a reward signal and on-policy exploration. SWE-Lego pushes back on this. Its position is that the ceiling people hit with SFT is usually a data ceiling and a loss ceiling, not a fundamental limit of the paradigm. Fix what the model is trained on, and fix which tokens it is asked to imitate, and the supervised baseline moves a long way up before RL ever enters the picture.

Concretely, the recipe rests on three building blocks: a curated dataset, a refined training procedure, and a test-time scaling stage. Each is simple in isolation; the result comes from getting all three right at once.

Building block 1 — Data that is real, varied, and verified

The dataset pairs two kinds of supervision. The first is task instances: roughly 32,000 of them, drawn from over 3,000 real repositories and filtered for the properties that make a task usable for training — a clear problem statement, a runnable environment, and tests that actually discriminate a correct fix from a broken one. The second is trajectories: about 18,000 full agent rollouts — the sequences of thoughts, tool calls, file edits, and test runs that carry a task from its initial broken state to a passing one.

A three-stage pipeline diagram: repository collection and sandbox construction, task instance creation and validation, and trajectory rollout and validation.
Figure 2. The SWE-Lego data pipeline: environment construction from 3,000+ repositories, test-driven task-instance creation (real and synthetic), and validated trajectory rollout. These are the same three stages LegoFactory runs as blocks — swegen and trajgen.

Two design choices matter here. First, the data mixes real and synthetic sources rather than betting on either alone: real issues bring the messy distribution of actual software work, while synthetic generation — bug injection, LLM rewrites, AST re-transformation — fills in coverage where naturally occurring data is thin. The payoff is concrete: adding synthetic instances per repository scales the volume of valid trajectories and steadily lifts resolve rate.

Two line charts. Left: number of valid trajectories versus number of repositories, for real-only data and real plus 1, 3, or 5 synthetic instances per repo. Right: resolve rate on SWE-bench Verified under the same conditions.
Figure 3. Mixing synthetic tasks into each repository scales the number of valid trajectories (left) and lifts resolve rate on SWE-bench Verified (right) — neither source alone is sufficient.

Second — and this is the part LegoFactory takes most seriously — the trajectories are validated, not merely collected. A trajectory only earns its place if it genuinely resolves the task: the edits apply, the previously failing tests pass (FAIL-TO-PASS), and the previously passing tests stay passing (PASS-TO-PASS). Unverified trajectories are exactly the kind of plausible-but-wrong supervision that quietly teaches a model bad habits, and the recipe refuses to pay that cost. This is why LegoFactory’s swegen stage only keeps tasks that pass a NOP/Oracle check, and trajgen only keeps trajectories that actually succeed.

Building block 2 — Training that respects what went wrong

Given good data, the question becomes how to learn from it. A naïve SFT run treats every token in every kept trajectory as something to imitate. SWE-Lego refines this in two ways.

Error masking

Even a trajectory that ultimately succeeds is rarely clean. The agent takes a wrong turn, runs a command that fails, edits the wrong file, backtracks, and only then converges. If you train on the raw sequence, you teach the model the recovery and the mistake that made recovery necessary. Error masking addresses this by withholding the training loss from the actions that represent missteps, while keeping the full trajectory in context — so the gradient flows through the parts worth imitating and not through the dead ends.

An example agent trajectory. A successful 'think' action is shaded as a loss mask; a later str_replace action that produces an ERROR is shaded as a gradient update without masking, illustrating which steps contribute to training loss.
Figure 5. Step-level error masking keeps the full trajectory in context but withholds the training loss on failed actions (the erroring str_replace), so the model imitates the recovery, not the misstep.

Difficulty-based curriculum

Not all of the 18k trajectories are equally hard to learn from. SWE-Lego orders training by difficulty — using trajectory length as a proxy in a multi-stage curriculum — letting the model consolidate the mechanics of issue resolving on easier instances before being asked to fit the long, branching trajectories that hard tasks produce. In the ablations, dropping either error masking or curriculum learning costs resolve rate on both the 8B and 32B models, and dropping both costs the most.

Building block 3 — Spending compute at test time

The third block is where the largest single jump in the headline numbers comes from. After fine-tuning, SWE-Lego trains a verifier — a model whose job is not to solve the task but to judge candidate solutions. At inference, the policy samples several attempts at a given issue; the verifier scores them; the best-scoring candidate is submitted. This is the meaning of TTS@k: test-time scaling with k sampled candidates.

Two charts of resolve rate versus number of rollouts. Left: generative versus regressive verifier for the 8B and 32B models. Right: SWE-Lego's verifier compared with existing verifiers against the Pass@K optimal curve.
Figure 9. Parallel test-time scaling. Left: a generative verifier beats a regressive one as rollouts grow. Right: SWE-Lego’s verifier against existing verifiers and the Pass@K optimal — closing much of the gap between what the policy can produce and what it submits.

The gains are substantial and consistent across model sizes. For the 8B model, test-time scaling lifts resolution from 42.2% to 49.6%. For the 32B model, TTS@16 lifts it from 52.6% to 58.8%. The intuition is that a well-trained policy already contains a correct solution among its samples more often than it manages to produce one on the first try — so the bottleneck shifts from generation to selection, and a good verifier is a cheap way to buy back that gap. The paper also finds a generative verifier (predict “yes/no”, then reason) outperforms a regressive one, and that the best strategy is sequential-then-parallel.

The numbers, in one place

On SWE-bench Verified:

Both are competitive with, or ahead of, open models in their size class — and both are reached with supervised fine-tuning plus a verifier, no reinforcement learning in the loop. That is the paper’s central, slightly provocative point: a large fraction of what people reach for RL to obtain is recoverable from better data, a more careful loss, and a modest amount of test-time search.

What it means for LegoFactory

A recipe is only as repeatable as the pipeline that feeds it. The reason SWE-Lego’s data story translates into LegoFactory is that every ingredient the paper relies on maps onto a stage you can run:

The paper shows that this sequence, done carefully once, reaches the state of the art. LegoFactory’s bet is that the more valuable property is being able to do it continuously — to keep collecting fresh instances as repositories evolve, keep validating trajectories, and keep retraining — without a human re-wiring the pipeline each time. That is what the block-wise, agent-operated design is for: every stage above is a self-contained block with a contract, so the whole recipe composes, and re-runs, like Lego.

If you want the full method and ablations, read the paper: SWE-Lego (arXiv:2601.01426). If you want to run the pipeline that produces this kind of data, start with the docs.