I've Been Using Claude 4 Every Day. Here's What Actually Changed.

Anthropic released Claude 4 — Sonnet and Opus — and I've been running it daily in my actual production work for weeks now.

Not in demos. Not in benchmark comparisons. In the things I actually need to get done.

Here's what changed for me.

It Stopped Arguing With Me

The older Claude versions had this habit of over-qualifying everything. You'd ask for something specific and it would give you three versions with a paragraph explaining why it chose each one. Helpful sometimes. Exhausting most of the time.

Claude 4 makes a decision and explains it once, briefly, only if the decision isn't obvious. That's a significant shift in working rhythm.

When you're building something — a script, a pipeline, a campaign — you don't want a committee. You want a collaborator who has an opinion and moves.

The Reasoning Is Different

I'm not a developer. I never wanted to be. But I now run automated video pipelines, content systems, and client-facing tools that I built with AI assistance.

With previous models, there was a ceiling. Complex logic would break down. You'd get something that looked right but failed in production.

With Claude 4, the ceiling moved. I've given it multi-step problems — "build this system that does X, then Y, then handles this edge case" — and it holds the full context without dropping threads halfway through.

For non-technical people trying to build actual things with AI, that reliability is the whole game.

What I Use It For (Specifically)

In the last month: automated video production pipelines (Remotion + ElevenLabs + Gemini), a client-facing website for a medical AI agency, cold email sequences for a lead gen product, and social content systems that run on their own.

None of that was possible for me two years ago. Not because the ideas didn't exist — because the execution required a technical team I couldn't afford.

Claude 4 is not just a better chatbot. It's the difference between having an idea and actually shipping it.

The One Thing It Still Gets Wrong

Context management over very long sessions. If you're working on a large codebase across multiple hours, it still loses detail from early in the conversation. The work-around is disciplined documentation — keeping explicit notes that you feed back in when needed.

It's a real limitation. It's also manageable if you structure your work correctly.

I've learned to treat every session as a handoff. Summarize where you are, what decisions were made, what comes next. It takes two minutes and prevents a lot of frustration.

The Honest Assessment

I've tried GPT-4o. I've tried Gemini. I keep coming back to Claude.

Not for any feature that a spec sheet would highlight. For how it writes. How it reasons through ambiguity. How it handles creative direction instead of just technical instruction.

I direct commercials. I write sketches. I compose music. The model I use most is the one that sounds least like a machine when I ask it to help me think.

Claude 4 is that model right now.

That can change. It probably will. But today, this is what I'm using — and that's worth saying out loud.

Where I use it that nobody talks about

The places I get the most value out of Claude 4 are not the obvious ones. They are not coding assistance, even though I do that too. They are creative-judgment moments that used to require a peer on the phone.

Reading a treatment I wrote three days ago and asking it whether the second act earns the ending. Submitting a client's brief and asking what is missing from their stated goals. Looking at a transcript of a creative call and asking which decision the team avoided making. Pasting a piece of copy and asking what a senior creative director would push back on.

None of these are tasks. They are second-opinion checks. The ones I used to call a friend for at 11 pm and feel guilty about. Claude 4 is good enough at this kind of read that I do not feel like I am wasting the question, which is the bar for whether a tool gets used or quietly abandoned.

The boring metric that actually matters

The benchmark I care about is not on any leaderboard. It is the ratio of times I close the chat satisfied to times I close it frustrated. With previous models, that ratio was somewhere around 60-40 in favor of satisfaction. With Claude 4, it has shifted closer to 85-15. Three percentage points off is the difference between a tool I reach for and a tool I avoid.

That is the metric to watch as you evaluate any AI tool: not what it does in a demo, but how often you actually want to use it once nobody is watching.

Sources: Anthropic — Claude | Anthropic News

It Stopped Arguing With Me

The Reasoning Is Different

What I Use It For (Specifically)

The One Thing It Still Gets Wrong

The Honest Assessment

Where I use it that nobody talks about

The boring metric that actually matters

Ulisses Balbino

Adobe Just Let 30 AI Models Into Firefly. Here's the One Feature That Actually Matters.

I Tested LTX-Video 2 for Commercial Production. Here's My Honest Take.

I Published My LTX-Video 2 Review Yesterday. Today LTX 2.3 Dropped with 4K and Audio.