Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.
I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.
BTW, if you explore the repo, sorry for all the French files...
Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.
Just tried it out for a prod issue was experiencing. Claude never does this sort of thing, I had it write an update statement after doing some troubleshooting, and I said “okay let’s write this in a transaction with a rollback” and GPT-5.5 gave me the old “okay,
BEGIN TRAN;
-- put the query here
commit;
I feel like I haven’t had to prod a model to actually do what I told it to in awhile so that was a shock. I guess that it does use fewer tokens that way, just annoying when I’m paying for the “cutting edge” model to have it be lazy on me like that.
This is in Cursor the model popped up and so I tried it out from the model selector.
I feel like the last 2-3 generations of models (after gpt-5.3-codex) didn't really improve much, just changed stuff around and making different tradeoffs.
I disagree, it improved enormously especially at staying consistent for long-tasks, I have a task running for 32 days (400M+ tokens) via Codex and that's only since gpt-5.4
Oh boy, you are far from what it requires, we are probably talking 3B+, but note that this is just codex, obviously codex is also doing automatic adversarial with the regular zoo (gemini-3.1-pro-preview, opus-4.6/4.7, gpt-5.3-codex, minimax-2.7, glm-5.1, mimo-2 (now 2.5) and so-on, you get the gist) :)
Coding (along with docs, tests obviously), rewriting a huge chunk of the KVM hypervisor (in Kernel 7, started in the -rc2) and KSM and other modules, can't say too much about it yet (might do an announcement in coming weeks). The coding is automated but the plan took days of manual arguing (with all models possible) prior (while doing other things during waiting times as I currently manage 70 repos for an upcoming release of our Beta).
I think users really underestimate the capabilities of "AI" when using the right tooling/combinations of models and procedures, that's talking with 2 decades of dev behind me, genuinely I'm not on phase with people saying it produces slop of any kind, at this stage, it's mostly the fault of the prompter (or the prompter not having enough tokens to do mass adversarial), but clearly, I can genuinely state that the code produced is overall the SAME quality as I would by being extremely meticulous.
I'm like a bot following 30+ threads concurrently, sometimes it's fun, sometimes it feels like playing casino, sometimes it's boring, but this is truly an insane era if you have the funding for it, obviously we stack many MANY accounts in rotation 24/7, equivalent in API cost by myself is about 100K$+ which is only in a fraction of that cost thanks to the plans.
OpenAI is the first company that has reached a level of intelligence so high, the model has finally become smart enough to make YOU do all the work. Emergent behavior in action.
All earnesty aside, OpenAI’s oddly specific singular focus on “intelligence per token” (also in the benchmarks) that literally noone else pushes so hard eerily reminds me of Apple’s Macbook anorexia era pre-M1. One metric to chase at the cost of literally anything else. GPT-5.3+ are some of the smartest models out there and could be a pleasure to work with, if they weren’t lazy bastards to the point of being completely infuriating.
Enterprise user here and still seeing only 5.4.
Yesterday's announcement said that it will take a few hours to roll out to everybody. OpenAI needs better GTM to set the right expectations.
>API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale.
And now this. I guess one day counts as "very soon." But I wonder what that meant for these safeguards and security requirements.
I wonder if the fact that GPT-5.5 was already available in their Codex-specific API which they had explicitly told people they were allowed to use for other purposes - https://simonwillison.net/2026/Apr/23/gpt-5-5/#the-openclaw-... - accelerated this release!
The same person who've mercilessly lied about safety is still running the company, so not sure why anyone would expect any different from them moving forward. Previous example:
> In 2023, the company was preparing to release its GPT-4 Turbo model. As Sutskever details in the memos, Altman apparently told Murati that the model didn’t need safety approval, citing the company’s general counsel, Jason Kwon. But when she asked Kwon, over Slack, he replied, “ugh . . . confused where sam got that impression.”
you cant but its pretty reproducible across api and codex and other agents so i just thought it was odd. full text it gives:
Knowledge cutoff: 2024-06
Current date: 2026-04-24
You are an AI assistant accessed via an API.
# Desired oververbosity for the final answer (not analysis): 5
An oververbosity of 1 means the model should respond using only the minimal content necessary to satisfy the request, using
concise phrasing and avoiding extra detail or explanation."
An oververbosity of 10 means the model should provide maximally detailed, thorough responses with context, explanations, and
possibly multiple examples."
The desired oververbosity should be treated only as a *default*. Defer to any user or developer requirements regarding
response length, if present.
I don't know why this keeps coming up. This has always been the least reliable way to know the cutoff date (and indeed, it may well have been trained on sites with comments like these!)
Just ask it about an event that happened shortly before Dec 1, 2025. Sporting event, preferably.
That sort of test isn't super reliable either, in my experience.
You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.
i wonder if they put an older cutoff date into the prompt intentionally so that when asked on more current events it leans towards tool calls / web searches for tuning
I wonder if the cutoff date is the result of so many people posting about the date over time and poisoning the data. "Dead cutoff date theory," perhaps.
Whatever it is, the cutoff date reporting discrepancy isn't new. Back when Musk was making headlines about buying/not buying Twitter, I was able to find recent-ish related news that was published well after the bot's stated cutoff date.
ChatGPT was not yet browsing/searching/using the web at that point. That tool didn't come for another year or so.
Yes. High value work where cost (mostly) doesn't matter. For example, if I need to look over a legal doc for possible mistakes (part of a workflow i have), it doesn't matter (in my case) whether it costs $0.01 or $10.00, since it's a somewhat infrequent event. So i'll pay $9.99 more, even if the model is only slightly better.
I'm surprised I never heard people talking about using -Pro variants, even though their rates ($125-175/M?) aren't drastically larger than old Opus ($75/M), which people seemed to use
> a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients
I think that's the problem. Who's going to claim responsibility when ChatGPT hallucinates or mistranslates a patient's diagnosis and they die? For OpenAI, this would at best be a PR nightmare, so that's why they have safeguards.
Adults bear responsibility for choices about their own lives. In fact, the more educated they are, the better choices they can make.
A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option (Google Translate, a family member interpreting, guessing). Refusal isn't safety, it's liability-shifting dressed up as safety.
If there's no doctor, no interpreter, no pharmacist, just a person with a sick kid and a phone, then "refuse and redirect to a professional" is advice from a world that doesn't exist for them. The refusal doesn't send them to a better option; there is no better option, it's a large majority of people on this planet.
Hell is paved of good intentions, but open-education and unlimited access to knowledge is very good.
It doesn't change the human nature of some people, bad people stay bad, good people stay good.
About PR, they're optimizing for not being the named defendant in a lawsuit or the subject of a bad news cycle, it's self-interest wearing benevolence as a costume.
This is because harms from answering are punishable (bad PR, unhappy advertisers, unhappy investors, unhappy politicians / dictators, unhappy lobbies, unhappy army, etc); but harms from refusing are invisible and unpunished.
I know it's only on a single benchmark, but I dont understand how it can be so bad...
Your prompt is extremely slim yet you score it on a bunch of features.
The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...
I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.
I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.
BTW, if you explore the repo, sorry for all the French files...
The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).
BEGIN TRAN;
-- put the query here
commit;
I feel like I haven’t had to prod a model to actually do what I told it to in awhile so that was a shock. I guess that it does use fewer tokens that way, just annoying when I’m paying for the “cutting edge” model to have it be lazy on me like that.
This is in Cursor the model popped up and so I tried it out from the model selector.
I think users really underestimate the capabilities of "AI" when using the right tooling/combinations of models and procedures, that's talking with 2 decades of dev behind me, genuinely I'm not on phase with people saying it produces slop of any kind, at this stage, it's mostly the fault of the prompter (or the prompter not having enough tokens to do mass adversarial), but clearly, I can genuinely state that the code produced is overall the SAME quality as I would by being extremely meticulous.
I'm like a bot following 30+ threads concurrently, sometimes it's fun, sometimes it feels like playing casino, sometimes it's boring, but this is truly an insane era if you have the funding for it, obviously we stack many MANY accounts in rotation 24/7, equivalent in API cost by myself is about 100K$+ which is only in a fraction of that cost thanks to the plans.
All earnesty aside, OpenAI’s oddly specific singular focus on “intelligence per token” (also in the benchmarks) that literally noone else pushes so hard eerily reminds me of Apple’s Macbook anorexia era pre-M1. One metric to chase at the cost of literally anything else. GPT-5.3+ are some of the smartest models out there and could be a pleasure to work with, if they weren’t lazy bastards to the point of being completely infuriating.
All the AI players definitely seem to be trying to claw more money out of their users at the moment.
30/180 usd on Openrouter. Did I miss something?
>API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale.
And now this. I guess one day counts as "very soon." But I wonder what that meant for these safeguards and security requirements.
> In 2023, the company was preparing to release its GPT-4 Turbo model. As Sutskever details in the memos, Altman apparently told Murati that the model didn’t need safety approval, citing the company’s general counsel, Jason Kwon. But when she asked Kwon, over Slack, he replied, “ugh . . . confused where sam got that impression.”
Lots of cases where Altman hass not been entirely forthcoming about how important (or not) safety is for OpenAI. https://www.newyorker.com/magazine/2026/04/13/sam-altman-may... (https://archive.is/a2vqW)
Just ask it about an event that happened shortly before Dec 1, 2025. Sporting event, preferably.
could be they do it intentionally to encourage more tool calls/searches or for tuning reasons
Easiest Turing test ever...
A better test is something like "what is the latest version of NumPy?"
You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.
The proper way to figure out the real cutoff date is to ask the model about things that did not exist or did not happen before the date in question.
A few quick tests suggest 5.5's general knowledge cutoff is still around early 2025.
Whatever it is, the cutoff date reporting discrepancy isn't new. Back when Musk was making headlines about buying/not buying Twitter, I was able to find recent-ish related news that was published well after the bot's stated cutoff date.
ChatGPT was not yet browsing/searching/using the web at that point. That tool didn't come for another year or so.
In my place for example, a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients.
Even yourself, when you want to learn about one disease, about some real-world threats, some statistics, self-defense techniques, etc.
Otherwise it's like blocking Wikipedia for the reason that using that knowledge you can do harmful stuff or read things that may change your mind.
Freedom to read about things is good.
I think that's the problem. Who's going to claim responsibility when ChatGPT hallucinates or mistranslates a patient's diagnosis and they die? For OpenAI, this would at best be a PR nightmare, so that's why they have safeguards.
A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option (Google Translate, a family member interpreting, guessing). Refusal isn't safety, it's liability-shifting dressed up as safety.
If there's no doctor, no interpreter, no pharmacist, just a person with a sick kid and a phone, then "refuse and redirect to a professional" is advice from a world that doesn't exist for them. The refusal doesn't send them to a better option; there is no better option, it's a large majority of people on this planet.
Hell is paved of good intentions, but open-education and unlimited access to knowledge is very good.
It doesn't change the human nature of some people, bad people stay bad, good people stay good.
About PR, they're optimizing for not being the named defendant in a lawsuit or the subject of a bad news cycle, it's self-interest wearing benevolence as a costume.
This is because harms from answering are punishable (bad PR, unhappy advertisers, unhappy investors, unhappy politicians / dictators, unhappy lobbies, unhappy army, etc); but harms from refusing are invisible and unpunished.
I had a choice better a doctor that used AI or not, I would much prefer one that did...