Evaluating the Model - Search News

Has Gemini surpassed ChatGPT? We put the AI models to the test.

For this test, we’re comparing the default models that both OpenAI and Google present to users who don’t pay for a regular ...

InfoWorld

Alibaba’s Qwen3-Max-Thinking expands enterprise AI model choices

The company claims the model demonstrates performance comparable to GPT-5.2-Thinking, Claude-Opus-4.5, and Gemini 3 Pro.

13h

Explore AI models more deeply with ChatPlayground AI for $79

AI tools push you into a single workflow or stop you mid-project with monthly limits. ChatPlayground AI is built for ...

Dark Reading

AI & the Death of Accuracy: What It Means for Zero-Trust

AI "model collapse," where LLMs over time train on more and more AI-generated data and become degraded, can introduce a host ...

23hOpinion

Constantly Re-Evaluating AI Progress

Sometimes it seems like that’s what it’s like to track AI progress in 2026. The year just started, but we’re in a different ...

Communications of the ACMOpinion

The Patterns of Research Excellence

The “one big breakthrough” pattern suggests that total citation counts can mislead. A researcher with one highly-cited paper ...

SpaceNews

NASA evaluation lauds quality of PlanetiQ radio occultation data

NASA’s assessment of PlanetiQ datasets lauded the precision of PlanetiQ’s total electron content observations as “best-in-class,” citing high signal-to-noise ratio and deep penetration in the lower ...

4don MSN

How this 30-year-old Pokemon game is helping Google, OpenAI and Anthropic to evaluate AI models

Tech giants like Google, OpenAI, and Anthropic are leveraging 1990s Pokemon games to rigorously test their advanced AI models ...

EurekAlert!

Expert consensus outlines a standardized framework to evaluate clinical large language models

Large language models (LLMs) play a key role in advancing intelligent healthcare. While LLMs are increasingly applied in ...

Forbes

Augmenting The American Psychiatric Association App Evaluation Model To Include AI-Based Mental Health Apps

Forbes contributors publish independent expert analyses and insights. Dr. Lance B. Eliot is a world-renowned AI scientist and consultant. In today’s column, I examine an existing formalized evaluation ...

Tech Xplore on MSN

AI models tested on Dungeons & Dragons to assess long-term decision-making

Large Language Models, like ChatGPT, are learning to play Dungeons & Dragons. The reason? Simulating and playing the popular ...

12d

LG AI Research leads Korea AI model review; Naver Cloud out

South Korea released first-stage evaluation results for its "independent AI foundation model" project, with LG AI Research ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results