US Claims China’s AI Lags — Experts Call the Test Rigged

Why the US Government’s Claim About China’s AI Superiority Deserves Skepticism

The US government insists China’s best AI models trail behind American counterparts, but that claim rests on selective evidence and geopolitical self-interest. The National Institute of Standards and Technology (NIST) just evaluated China’s DeepSeek V4 Pro, then declared it inferior—using a methodology critics say conveniently favored the US according to Decrypt. When the stakes involve global tech dominance, leaning on opaque tests instead of open benchmarks undermines trust and clouds the real picture.

Objective, transparent benchmarking is the only way to cut through the spin. Anything less risks turning AI performance into a political talking point rather than a scientific measurement. With Washington and Beijing both racing for AI supremacy, the world deserves honest data, not selective validation.

How NIST’s CAISI Evaluation Methodology May Skew AI Model Comparisons

NIST’s new CAISI evaluation should have been a step toward clarity. Instead, it raises more questions than answers. The agency pitted DeepSeek V4 Pro against US models using private benchmarks—unpublished datasets and tasks that outsiders can’t scrutinize. Worse, NIST applied a cost-comparison filter that excluded every American model except OpenAI’s GPT-5.4 mini, a move that drastically narrows the competitive field.

Here’s why that matters: CAISI’s cost filter is supposed to ensure that models compared are similar in efficiency, but it also means the most powerful US models—like GPT-4o or Anthropic’s Claude 3—weren’t even considered. That’s like holding a footrace, then only inviting one runner from a dominant team and declaring victory if he finishes ahead of a newcomer. The resulting headline—“China’s best lags behind”—sounds definitive, but the test is anything but.

Private benchmarks are another red flag. When results can’t be independently verified or replicated, trust evaporates. The AI community has rallied around open benchmarks—like MMLU or LLM Leaderboard—precisely because they allow apples-to-apples comparisons. CAISI’s private tests may measure something, but no one outside NIST knows what, or whether those tests reflect real-world performance.

The upshot: selective benchmarking can make solid models look weak and weak models look strong, depending on what gets measured and who gets excluded. If the goal is to inform policy and investment, that’s a dangerous game.

Expert Opinions Highlight the Complexity of Measuring AI Performance Across Borders

Top AI researchers aren’t buying the US government’s tidy narrative. Some point out that Chinese models often outperform Western ones on certain tasks—especially those involving Mandarin, local context, or government-approved datasets. Others highlight that technical “leadership” means little if models aren’t tested in the same environment, with the same constraints.

For example, training data access in China can be both broader and more restricted. Chinese firms sometimes enjoy access to state-sponsored data pools but face tighter censorship and regulatory red tape. US models often train on larger, more diverse corpora, but must navigate privacy lawsuits and content moderation debates. These differences make cross-border comparisons messy—and any ranking suspect.

Transparency is the only antidote to this confusion. OpenAI’s release of eval leaderboards, Stanford’s HELM project, and Hugging Face’s open evaluation suite all aim to standardize the rules of engagement. When models are tested on the same public datasets, with results published for all to see, the conversation shifts from national boasting to genuine progress. Until NIST and its counterparts adopt this approach, skepticism will remain the default.

Acknowledging the Counterargument: The US Perspective on AI Leadership and National Security

Washington has plenty of reason to trumpet American AI dominance. Leadership in AI isn’t just about bragging rights—it’s about military advantage, economic leverage, and the ability to set global standards. Public statements stressing US superiority serve a deterrent function, signaling to rivals that the US won’t cede technological ground.

There’s a logic to this posture. Downplaying US strengths could embolden adversaries or erode investor confidence. At the same time, the risk of underestimating foreign advances is real. Chinese models have improved at breakneck speed: DeepSeek’s predecessor barely registered on international leaderboards last year, but today’s version claims competitive performance on high-level reasoning and coding tasks. Dismissing these gains outright is shortsighted, especially when the stakes include national security and future economic growth.

Still, national security shouldn’t justify opaque or misleading evaluation practices. If the US is ahead, it should welcome fair competition—and substantiate its claims with evidence that stands up to scrutiny.

Demanding Greater Transparency and Balanced Evaluations to Foster Global AI Progress

The solution isn’t to stack the deck, but to level the playing field. Policymakers and research agencies on both sides of the Pacific should commit to transparent, standardized AI benchmarking. That means using open, audited datasets, publishing evaluation code, and inviting third-party audits.

Collaboration between international research groups would raise the bar—and lower the temperature. Shared benchmarks could expose weaknesses and drive improvement on both sides, while making it harder for any government to fudge the numbers for domestic audiences. The AI community has already built the tools; now, it’s up to regulators to use them.

The bigger risk isn’t that one side “wins” the AI race, but that both get trapped in a spiral of hype, secrecy, and mistrust. The world needs competition, not information warfare. If governments want to foster real progress, they should start by playing fair.

Impact Analysis

Opaque benchmarking makes it hard to trust claims about AI leadership.
Geopolitical interests may distort how AI performance is reported and compared.
Transparent, open testing is crucial for meaningful global competition in AI.

Model	Country	Included in Evaluation	Benchmark Type	Result
DeepSeek V4 Pro	China	Yes	Private/Unpublished	Inferior
GPT-5.4 mini	US	Yes	Private/Unpublished	Superior
GPT-4o	US	No (excluded by cost filter)	Not Evaluated	Not Available
Claude 3	US	No (excluded by cost filter)	Not Evaluated	Not Available

US Claims China’s AI Lags — Experts Call the Test Rigged

Why the US Government’s Claim About China’s AI Superiority Deserves Skepticism

How NIST’s CAISI Evaluation Methodology May Skew AI Model Comparisons

Expert Opinions Highlight the Complexity of Measuring AI Performance Across Borders

Acknowledging the Counterargument: The US Perspective on AI Leadership and National Security

Demanding Greater Transparency and Balanced Evaluations to Foster Global AI Progress

Impact Analysis

Related Articles

DeepClaude Slashes Claude Code Costs 17x With DeepSeek Brain

Cerebras Bets $26B IPO on OpenAI Chip Powerhouse

OpenMythos Sparks AI Race to Crack Anthropic’s Locked-Down Mythos

Stay ahead of the curve

Comparison of NIST CAISI Evaluation: DeepSeek V4 Pro vs. US AI Models

Sources

MLXIO Publisher Team

Explore More Topics