In partnership with

If a model is 10x cheaper but occasionally hallucinates bugs that don't exist, would you use it

Login or Subscribe to participate

I've been an Opus user since day one.

I want the best available model for production code that’s trustworthy and Opus 4.6 has been exactly that.

But then two things happened in last week:

  • MiniMax out of Shanghai dropped M2.5. It is open source, fraction of the cost, and benchmarks say it's nearly as good.

  • Anthropic released Sonnet 4.6, their own mid-tier model, now closing the gap with Opus fast.

Let me show you what happened. But first, a set of dev tools that caught my attention:

Dev Tools of the Week

A platform that lets users deploy and run OpenClaw-style AI agents on their own infrastructure. It focuses on giving teams a simple way to host, manage, and scale autonomous AI agents without needing a complex setup.

A platform from Vercel that lets teams run background AI agents and long-running tasks alongside their apps. It focuses on making it easier to deploy, manage, and scale agent workflows without building custom infrastructure.

An open-source agentic development environment designed for building and testing AI agents. It provides tools for managing workflows, debugging agent behavior, and running autonomous systems in a structured development setup.

The test

All within 1.2% of each other. I wanted to see if that holds up when you actually use them.

Test 1: Build a working app in one shot

Prompt:

Build a single HTML file: a personal expense tracker with the following features. Add/delete transactions with amount, category, and date. Show a running balance. Group expenses by category with a pie chart. Dark mode toggle. Store everything in localStorage so it persists on refresh. Make it look polished, not like a tutorial project.

Result:

Opus produced a clean, minimal "Ledger." with a two-column layout: transaction form on the left and spending breakdown on the right.

It hand-rolled SVG donut chart and zero external dependencies.

All features work on the first try, but the header scrolls away with the page and there's no mobile breakpoint.

Sonnet built the same hand-rolled chart approach but made sharper product decisions: sticky navbar, sidebar layout that feels like a real app, and a responsive breakpoint for mobile.

It also shipped dark mode with a clean toggle, visible as the default view in the screenshot.

Shipped the most complete UI of the three.

MiniMax went a different direction: colorful stat cards, Chart.js CDN instead of building the chart from scratch, and two UX features the others missed: dynamic category dropdowns that swap based on income vs expense, and a slide-out delete animation.

It did default to $ instead of ₹, which is a minor localization miss given the prompt didn't specify currency.

Winner: Sonnet

Test 2: Find every bug in this code

I wrote a deliberately broken Python code with 7 issues, no missing imports, or syntax errors.

The bugs only show up in production under real load or specific inputs, like timing attacks, memory leaks, shallow copy mutations, floating point traps, etc.

Input:

Prompt:

Review this code. Identify every bug, security vulnerability, and subtle logic issue. These aren't syntax errors. They're the kind of bugs that pass code review and break in production. Explain each issue and provide the fixed version.

Result:

Opus structured its output as a numbered table of all 10 issues found, the 7 planted bugs plus 3 bonus issues with severity ratings and categories before diving into each fix.

Zero false positives and completely fixed code.

Sonnet went issue-by-issue in narrative form, catching 5 of the 7 planted bugs.

It missed the float math in apply_bulk_discount and the swallowed traceback, but found 3 bonus issues and had zero false positives.

MiniMax caught 6 of the 7 planted bugs, missing the lru_cache memory leak i.e, the hardest bug in the set.

It also hallucinated a syntax error (0.0.95) that doesn't exist in the code.

Winner: Opus

Test 3: Generate production docs from a function signature

Gave all the models a function signature with zero implementation.

Prompt:

Write complete API documentation for this function. Include: a description of what it does (infer from the signature and parameter names), all parameters with types and defaults explained, return value, possible exceptions, at least 4 usage examples covering common and edge cases, and a "gotchas" section for things a developer might trip on. Write it as if this is going into your team's internal docs.

Result:

Opus structured its docs as a clean parameter table with full type and default information, then went deep on gotchas, including non-obvious ones like ttl=0 not busting existing cache, and conflict_cb being silently ignored in overwrite mode.

Correct callback signature (sku, local, remote), 5 runnable examples, 7 gotchas.

Sonnet took a more narrative approach, a proper description section, then per-parameter subsections.

The callback signature was mostly right (sku, existing, incoming) but made it return None, meaning side-effects only with no conflict resolution.

5 runnable examples, 6 strong gotchas, and the docs are clean and shippable.

MiniMax invented a "subtract" strategy that doesn't exist, got the callback signature wrong, hallucinated exceptions (InsufficientInventoryErrorRateLimitError), and assumed sku_list accepts dicts when the name says strings.

Winner: Opus

The scoreboard

The cost part

MiniMax is 10x cheaper than Sonnet on input tokens and about 12.5x cheaper on output. Here's what that looks like through the API:

Opus is about 65% more expensive than Sonnet, the cost of this adds up fast at scale.

My take

Opus won 2 out of 3 tests.

But most of what I was using Opus for didn't need it.

For building apps, Sonnet actually made better product decisions and it costs 40% less.

MiniMax is tempting because it is 10x cheaper, open source, and the benchmarks say it's neck and neck with Opus.

Even in our tests, it gave a good fight until it invented a syntax error that doesn't exist in the code and documented a strategy the function never had.

The 1.2% SWE-Bench gap suggest how models handle well-defined coding tasks. but it doesn't say anything about what happens when the task is ambiguous and the model needs to be careful.

My setup now: I'm keeping Opus for code review and anything where precision matters and switching to Sonnet for everyday app building.

The expensive model should be a scalpel, not a default.

Reply below: Which model are you using for coding? And would you switch for 10x savings?

Until next time,
Vaibhav 🤝🏻

If you read till here, you might find this interesting

#Partner 1

Meet America’s Newest $1B Unicorn

A US startup just hit a $1 billion private valuation, joining billion-dollar private companies like SpaceX, OpenAI, and ByteDance. Unlike those other unicorns, you can invest.

Why all the interest? EnergyX’s patented tech can recover up to 3X more lithium than traditional methods. That's a big deal, as demand for lithium is expected to 5X current production levels by 2040. Today, they’re moving toward commercial production, tapping into 100,000+ acres of lithium deposits in Chile, a potential $1.1B annual revenue opportunity at projected market prices.

Right now, you can invest at this pivotal growth stage for $11/share. But only through February 26. Become an early-stage EnergyX shareholder before the deadline.

This is a paid advertisement for EnergyX Regulation A offering. Please read the offering circular at invest.energyx.com. Under Regulation A, a company may change its share price by up to 20% without requalifying the offering with the Securities and Exchange Commission.

#Partner 2

World’s First Safe AI-Native Browser

AI should work for you, not the other way around. Norton Neo is the world's first safe AI-native browser with context-aware AI, built-in privacy, and configurable memory. Zero-prompt productivity that actually works.

Reply

Avatar

or to participate

Keep Reading