Claude’s Code Quality Conundrum Continues

A lot is going on at Anthropic. Access to the almost-fabled Mythos model remains restricted (despite some reports of unauthorized access), and nobody knows quite what is likely to happen or when in terms of its final rollout.

Developers, meanwhile, are left with their own challenges; last week’s “upgrade” to Opus 4.7 has left some software engineers already longing for a return to 4.6 with its less literal instruction interpretation and its perhaps less cautious use of safeguards and controls.

Then there’s the Claude quality conundrum in and of itself.

Root of the Problem?

Anthropic says it recognizes the fact that users are reporting that they are getting “worsened responses” over the past month. In answer to this, the organization confirms it has traced these reports to three separate changes that affected Claude Code, the Claude Agent SDK, and Claude Cowork.

The Claude API and the inference layer were not impacted.

All three issues have now been resolved as of April 20 (version 2.1.116), confirms Anthropic.

Promising to move ahead “differently,” the Claude team has gone to pains to explain how they will ensure similar issues are much less likely to happen again.

Again, just bringing up the reports from many developers suggesting that there has been a degradation in model performance at Anthropic (try Googling “claude code developers unhappy” and look for Reddit and HackerNoon as prime examples of what people are saying – add in Opus4.7 to that search if you want the real nitty gritty), the company has stated that it “never intentionally degrades our models” in a company blog.

“On March 4, we changed Claude Code’s default reasoning effort from high to medium to reduce the very long latency – enough to make the UI appear frozen – some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they’d prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6,” stated Anthropic.

Syncing Out Older Thinking

After this event, on March 26, the team shipped a change to clear Claude’s “older thinking” from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions.

A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. The team fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.

“On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7,” stated Anthropic.

Aggregate Aggression

A lot is happening here, and it’s happening to a lot of moving parts concurrently. How that breaks down is a reality where each change affects a different slice of traffic on a different schedule – and that means that the aggregate effect looks like broad, inconsistent degradation.

Early reports of this were challenging to distinguish from normal variation in user feedback at first, and neither our internal usage nor subsequent evaluation exercise initially reproduced the issues identified.

Because this clearly isn’t the experience users should expect from Claude Code, as of yesterday at the time of writing (April 23), the company has reset usage limits for all subscribers.

One of the challenges here is that (perhaps obviously) the longer the model thinks, the better the output. Effort levels are how Claude Code lets users set that tradeoff – more thinking versus lower latency and fewer usage limit hits.

The Test-Time-Compute Curve

As we calibrate effort levels for our models, we take this tradeoff into account in order to pick points along the test-time-compute curve that give people the best range of options. In the product layer, we then choose which point along this curve we set as our default, and that is the value we send to the Messages API as the effort parameter; we then make the other options available via /effort,” confirms the team.

Looking to the future, we can expect Anthopic to get it in the neck on a fairly regular basis, often down to the fact that Claude Code is so widely used by the software application development cognoscenti. It won’t be hard to find the naysayers and anti-platform protesters saying bad things (Trump’s tech and cyber czar @DavidSacks on X is a fairly vitriolic stream if you’re sitting comfortably enough) and with an arguably less than effective ex-prime minister of Britain in the shape of Rishi Sunak on its advisory board, Athropic might do well to ask Claude itself what the current sentiment among its user base is.

from DevOps.com https://ift.tt/v3iu47f

News and Tech Update

Search This Blog