The Real Value of GitHub Copilot Rubber Duck

The easy part of AI-assisted coding is getting code produced, and that no longer feels like the real bottleneck.

What becomes harder, and much more expensive, is when an agent commits too early to a bad plan, carries the wrong assumption through a multi-file change, or produces something that looks tidy enough to pass a quick glance while still being wrong in the ways that matter. That is where a lot of the real pain sits, and it is also where a lot of the shallower excitement around AI coding starts to wear thin quite quickly.

That is why GitHub Copilot Rubber Duck caught my attention. From GitHub’s announcement, it is being framed as a review step inside GitHub Copilot CLI rather than just another push towards more generation.

One model does the main work, then a different model family is brought in to challenge the plan, implementation, or tests at points where a second opinion might actually be useful. That is a much more sensible direction than simply asking the same agent to keep trying until it sounds convincing.

Contents hide

1 Why a second opinion actually helps

2 What starts to matter on harder tasks

3 Why this matters more at platform scale

4 It only works if it stays targeted

5 A better direction for AI-assisted engineering

Why a second opinion actually helps

One of the awkward truths with coding agents is that they do not just make syntax mistakes or obvious implementation errors. They often fail more subtly than that, forming a neat, coherent, internally consistent plan around a weak premise and then executing it with enough confidence that the result looks credible right up until somebody properly checks whether it was the right thing to do.

That is a harder problem to deal with than basic code generation, because the issue is not that the model cannot produce output. The issue is that it can produce the wrong output in a very polished way, which is exactly why the idea of a second opinion becomes much more compelling.

Self-review is better than nothing, but it is still self-review. If the original framing was weak, asking the same model family to inspect its own reasoning only gets you so far. Bringing in a different model family to apply pressure is a more credible way to catch blind spots, especially when the task is large enough that the mistakes are no longer local.

What matters most here is not really the feature itself, but the admission behind it. GitHub is, in effect, recognising that more AI does not automatically mean better engineering. Sometimes what is needed is not more generation, but more friction: a pause, a challenge, and a second look from something that did not arrive at the same conclusion in the same way. That feels healthier than the usual pattern of treating every problem as a reason to add more autonomous behaviour.

What starts to matter on harder tasks

GitHub’s own write-up points to better results on harder, longer-running, multi-file tasks, which feels entirely plausible because those are exactly the situations where small judgement errors stop being small.

In a one-file change, a weak assumption might waste a few minutes. In a broader change across implementation, tests, structure, or surrounding conventions, it gets expensive very quickly. The code may still compile, the tests may still mostly run, and the pull request may still look productive, but none of that means the plan was any good.

That is one of the less comfortable things about agent-assisted development. The failure mode is often not chaos. It is plausible-looking output that gets further than it should, which is why a feature like Rubber Duck becomes more interesting in context. Not because it guarantees better outcomes, but because it creates a better chance of catching the sort of mistake that survives first inspection.

Why this matters more at platform scale

There is still an obvious caveat here. A feature like this only stays useful if it remains selective and sharp. If it becomes noisy, people will ignore it. If it produces generic critique, it becomes theatre. If it slows the workflow too much, it will get bypassed.

There is a fairly narrow path between valuable review and pointless ceremony, and plenty of AI tooling ends up drifting into the ceremony category faster than people like to admit.

I do not think the value here is “second model equals better result” as some kind of universal rule. That is too tidy. The value is more practical than that. On the right tasks, at the right checkpoints, with a meaningful difference in perspective, a second opinion can catch the kind of issues that are expensive precisely because they look fine at first.

It only works if it stays targeted

There is still an obvious caveat. A feature like this only stays useful if it remains selective and sharp. If it becomes noisy, people will ignore it. If it produces generic critique, it becomes theatre. If it slows the workflow too much, it will get bypassed.

There is a fairly narrow path between valuable review and pointless ceremony, and plenty of AI tooling ends up drifting into the ceremony category faster than people like to admit.

So I do not think the value here is “second model equals better result” as some kind of universal rule. That is too tidy. The value is more practical than that. On the right tasks, at the right checkpoints, with a meaningful difference in perspective, a second opinion can catch the kind of issues that are expensive precisely because they look fine at first.

That is useful.

A better direction for AI-assisted engineering

More broadly, I think this points towards a better shape for AI-assisted engineering. Not one model doing everything in a straight line, but workflows built around generation, review, checkpoints, and deliberate challenge. That is much closer to how good engineering already works, because strong teams do not just produce. They review, question, and tighten decisions before those decisions spread.

Rubber Duck will not magically fix weak judgement, but it is a more serious response to the problem than pretending the answer is always more autonomy.

Anyone can get an agent to produce code. The real test is whether the workflow helps stop confident mistakes before they become expensive ones. That is a much better standard to design around, and GitHub seems to be edging a little closer to it here.