failure failure failure

Sponsored by

When an AI confidently gives you something wrong, what do you do?

Someone on twitter this week put it simply: "5.4 xhigh is astonishingly good at coding. I find Opus 4.6 better at architecture and bug smashing but lazy."

I kept thinking about that framing because it maps to something I've noticed too.

These three models shipped within weeks of each other and on paper they're all "frontier coding models."

In practice they feel completely different depending on what you're trying to do.

GPT-5.4 just merged Codex's coding capability into the main model.

Opus 4.6 got a 1M context window and agent teams in Claude Code.

Cursor shipped Composer 2, their first fully custom model trained specifically for long-horizon coding tasks.

All three are competing for the same slot in your daily workflow. So I ran the same task through them. Here's what happened.

Dev Tools of the Week

1. Harbor

The official evaluation framework for Terminal-Bench 2.0. If you want to benchmark your own agent setup against the same harness Cursor and Anthropic used for their model releases, this is where you start.

2. Compaction

Anthropic's new context compaction feature for the Claude API. Automatically summarizes older context when a long-running agent approaches its limit, letting it keep working without hitting the wall.

3. Claude Code Channels

Use Claude Code Channels to push messages, alerts, and webhooks into your Claude Code session from an MCP server. Forward CI results, chat messages, and monitoring events so Claude can react while you’re away.

The Benchmark Picture

All three are publishing strong numbers right now and the leaderboard changes week to week, so here's what relevant for coding specifically.

On Terminal-Bench 2.0, which tests real agentic terminal tasks rather than synthetic problems, Opus 4.6 leads at the top of the official leaderboard.

Composer 2 scores 61.7 on the same benchmark using the official Harbor harness.

GPT-5.4 sits between them depending on reasoning effort level.

On SWE-bench Multilingual, Composer 2 scores 73.7, which is a meaningful jump from Composer 1.5's 65.9.

Opus 4.6 leads on SWE-bench Verified.

More than the numbers, this is the take I resonate with more and understand better:

The Test

Setup includes a multi-file Next.js codebase with a Postgres service layer, Redis queue, and webhook handler.

Prompt: "Trace the complete data flow for a user completing a challenge. Start from the API request, go through middleware, validation, business logic, XP calculation, database writes, and the webhook that triggers the frontend. Flag every place where a failure would be silent."

Result:

GPT-5.4 at xhigh

Fast and confident.

It traced the request accurately through auth middleware and challenge validation, then hit a wall around file 30 and fabricated a challengeRewardQueue.enqueue() call that doesn't exist in the codebase.

The actual implementation writes directly to Postgres and fires a Supabase edge function.

It built the rest of its trace around the phantom queue and never flagged any uncertainty.

The silent failure audit was very good despite this.

It found three real ones: the XP write happening outside the transaction block, the webhook firing before the DB commit confirms, and a missing catch on the Redis deduplication check that swallows errors.

The hallucination was confident and uncorrected. GPT-5.4 drifted without flagging it, which is a risk here.

When it drifts, it doesn't tell you.

Total loop time including my corrections: ~10 minutes.

Opus 4.6 via Claude Code

Before generating anything it asked one question: "Does the XP calculation happen synchronously in the request or is it deferred? This changes where the failure modes sit."

The trace was architecturally accurate across all the files.

It correctly identified that the XP logic runs in a service function that wraps its own try-catch and returns a default value on failure, meaning a broken XP calculation silently gives the user 0 XP with no error surfaced anywhere.

Seven silent failure findings in total, ranked by severity, one paragraph of explanation each.

The webhook section had one misread: it described the trigger as event-driven when it's a 2-second polling loop.

A misinterpretation rather than a fabrication, which matters because it's fixable with one correction instead of a full rewrite.

The lazy reputation holds for shallow tasks. On something with ambiguity it justified the time.

Total loop time: 19 minutes.

Cursor Composer 2

Composer 2 didn't produce a report.

It opened four files simultaneously and started annotating them inline. Before I had asked about silent failures, dropped a comment directly in the code: // NOTE: XP write is outside transaction — silent failure risk.

Same finding as Opus, surfaced as an edit rather than a finding.

It referenced a notificationService.triggerChallengeComplete() that doesn't exist, similar to GPT-5.4's phantom queue.

The difference: two responses later, without being prompted, it said "I referenced notificationService above but I don't see that in the provided context. You may want to verify this exists."

It caught its own drift.

The inline annotation workflow is a different experience.

For documentation or sharing findings with a team it's a pain to collect.

For acting on the findings immediately it's the fastest path from "I need to understand this" to "I've made the edits."

Total loop time: 12 minutes.

Winner: TIE. (and that’s the point)

TLDR

GPT-5.4 for initial exploration of an unfamiliar codebase because the speed is real and the large context window holds more than it used to.

Opus 4.6 when accuracy is p0 and you have 30 extra minutes and budget to match.

Composer 2 when you're going to act on the findings immediately rather than document them.

The Reddit commenter was right.

These models have different personalities and the one you reach for should depend on what you're trying to do.

My Take

The model race has gotten weird in a good way.

A year ago the question was "which model is smartest."

Now the question is "which model fits this workflow."

Composer 2 being faster and cheaper than Opus 4.6 on this task while missing the depth is a design choice.

Cursor built it for a specific interaction pattern and it's very good at that pattern.

The self-correction behavior has me hooked.

GPT-5.4 hallucinated with confidence while Composer 2 hallucinated and caught itself.

Opus 4.6 didn't hallucinate, it just misread.

Those are three different failure modes and they have different implications for how much you can trust the output without reviewing it yourself.

Benchmark scores measure a controlled race.

What I ran measures something closer to how these tools actually behave in production.

Try Composer

Try Opus 4.6

Try GPT 5.4

Until next time,
Vaibhav 🤝🏻

If you read till here, you might find this interesting

#Partner 1

Build a LinkedIn Growth Routine That Actually Compounds

Same post. Same person. Completely different results.

The difference? A growth routine.

Taplio is the all-in-one LinkedIn tool that helps you build a repeatable system: find proven content ideas in your niche, write posts faster with AI that matches your voice, engage with the right people using Smart Reply, and track what's working so you can do more of it.

Creators like Amanda Goetz used Taplio to grow 30,000+ followers. Teams like lemlist used it to generate over $3M in pipeline from LinkedIn.

Try Taplio free for 7 days and get your first month for $1 with code BEEHIIV1X1.

Try Taplio for free.

#Partner 2

Stop typing prompts. Start talking.

You think 4x faster than you type. So why are you typing prompts?

Wispr Flow turns your voice into ready-to-paste text inside any AI tool. Speak naturally - include "um"s, tangents, half-finished thoughts - and Flow cleans everything up. You get polished, detailed prompts without touching a keyboard.

Developers use Flow to give coding agents the context they actually need. Researchers use it to describe experiments in full detail. Everyone uses it to stop bottlenecking their AI workflows.

89% of messages sent with zero edits. Millions of users worldwide. Available on Mac, Windows, iPhone, and now Android (free and unlimited on Android during launch).

Try Wispr Flow free →