Is OpenAI GPT-5.2 actually better than Google Gemini 3 Pro? If you strip away the extra "thinking" time used in the benchmarks, the gap disappears. We dug into Is OpenAI GPT-5.2 actually better than Google Gemini 3 Pro? If you strip away the extra "thinking" time used in the benchmarks, the gap disappears. We dug into

OpenAI GPT-5.2: The “Cheating” Controversy

2025/12/15 12:58

Recently OpenAI released GPT-5.2 which has superior benchmark results. However, some online chatters reveal that OpenAI might have used more tokens and compute for the benchmark test, and might be considered “cheating” the tests. If everything is equal, is GPT-5.2 actually on par with Gemini 3 Pro? Here we try to find out.

The "Cheating" Controversy: Compute & Tokens

The core of the controversy lies in inference-time compute. "Cheating" in this context refers to OpenAI using a configuration for benchmarks that is significantly more powerful (and expensive) than what is available to standard users or what is typical for a "fair" comparison.

\

  • "xhigh" vs. "Medium" Effort: Reports indicate that OpenAI's published benchmark results were generated using an "xhigh" reasoning effort setting. This mode allows the model to generate a massive number of internal "thought" tokens (reasoning steps) before producing an answer.
  • The Issue: Standard ChatGPT Plus users reportedly only have access to "medium" or "high" effort modes. The "xhigh" mode used for benchmarks consumes vastly more tokens and compute, effectively brute-forcing higher scores by allowing the model to "think" for much longer (sometimes 30-50 minutes for complex tasks) than a standard interaction allows.
  • Inference Scaling: This leverages a concept where allowing a model to generate more tokens during inference (test time) improves performance significantly. Critics argue that comparing GPT-5.2's "xhigh" scores against Gemini 3 Pro's standard outputs is misleading because it compares a "maximum compute" scenario against a "standard usage" scenario.

Benchmark Comparison (GPT-5.2 vs. Gemini 3 Pro)

When the massive compute boost is factored in, GPT-5.2 does post higher scores, but the gap narrows or reverses when conditions are scrutinized.

\

| Benchmark | GPT-5.2 (Thinking/Pro) | Gemini 3 Pro | Context | |----|----|----|----| | ARC-AGI-2 | 52.9% | ~31.1% | Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. | | GPQA Diamond | 92.4% | 91.9% | Graduate-level science. The scores are effectively tied (within margin of error). | | SWE-Bench Pro | 55.6% | N/A | Real-world software engineering. GPT-5.2 sets a new SOTA here. | | SWE-Bench Verified | 80.0% | 76.2% | A more established coding benchmark. The models are roughly comparable here. |

\n

  • Private Benchmarks: Some independent evaluations (e.g., restricted "private benchmarks" mentioned in discussions) suggest that Gemini 3 Pro actually outperforms GPT-5.2 in areas like creative writing, philosophy, and tool use when the "gaming" of public benchmarks is removed.

Are They "On Par"?

Yes, and Gemini 3 Pro may even be superior in "base" capability.

\ If "everything is equal"—meaning both models are restricted to the same amount of inference compute (thinking time)—the general consensus implies they are highly comparable, with different strengths:

\

  • Gemini 3 Pro Advantages:
  • Base Intelligence: Appears to have stronger fundamental capability in long-context understanding (massive context window), theoretical reasoning, and creative tasks without needing excessive "thinking" time.
  • Cost Efficiency: For many tasks, it achieves similar results with less compute (and thus lower cost/latency).
  • GPT-5.2 Advantages:
  • Agentic Workflow: With the "Thinking" mode enabled (high compute), it excels at complex, multi-step agents and coding tasks (SWE-Bench). It is "tuned" effectively to use extra compute to solve harder problems.

\

Conclusions

The claim that they are "on par" is accurate. If you strip away OpenAI's "xhigh" compute advantage used in benchmarks, Gemini 3 Pro is likely equal or slightly ahead in raw model intelligence. GPT-5.2's "superiority" in benchmarks largely comes from its ability to spend significantly more time and compute processing a single prompt.

\ Based on the verification performed, here is the compiled list of sources regarding the GPT-5.2 release, the Gemini 3 Pro comparison, and the associated benchmarking controversy.

References

1. Official Release Announcements

OpenAI – System Card Update

  • openai.com/index/gpt-5-system-card-update-gpt-5-2/

    \n Google – The Gemini 3 Era

  • blog.google/products/gemini/gemini-3/

2. Benchmark Performance & Technical Analysis

R&D World – Comparative Analysis

\

  • Title: "How GPT-5.2 stacks up against Gemini 3.0 and Claude Opus 4.5"
  • Verified Details: Validates the 52.9% score on ARC-AGI-2 (Thinking mode) vs. Gemini 3 Pro's ~31.1%. Confirms GPT-5.2's lead in abstract reasoning is heavily tied to the "Thinking" process.
  • Source: rdworldonline.com/how-gpt-5-2-stacks-up \n

Vellum AI – Deep Dive

\

  • Title: "GPT-5.2 Benchmarks"
  • Verified Details: Verifies the 92.4% score on GPQA Diamond, noting it is effectively tied with Gemini 3 Pro (91.9%) when within the margin of error, but marketed as a "win" by OpenAI.
  • Source: vellum.ai/blog/gpt-5-2-benchmarks

\ Simon Willison’s Weblog

\

  • Title: "GPT-5.2"
  • Verified Details: Technical breakdown of the API pricing ($1.75/1M input) and the distinction between the "Instant" and "Thinking" API endpoints.
  • Source: simonwillison.net/2025/Dec/11/gpt-52/

3. The "Cheating" & Compute Controversy

Reddit (r/LocalLLaMA & r/Singularity)

\

  • Threads: "GPT-5.2 Thinking evals" & "OpenAI drops GPT-5.2 'Code Red' vibes"
  • Verified Details: These community discussions are the primary source of the "cheating" allegations. Users identified that OpenAI's benchmarks used "xhigh" (extra high) reasoning effort—a setting that uses significantly more tokens and time than the "Medium" or "High" settings available to standard users or used in Gemini's standard benchmarks.
  • Source: reddit.com/r/singularity/comments/1pk4t5z/gpt52thinkingevals/
  • Source: reddit.com/r/ChatGPTCoding/comments/1pkq4mc/

\ InfoQ News

\

  • Title: "OpenAI's New GPT-5.1 Models are Faster and More Conversational" (Contextual coverage including 5.2)
  • Verified Details: Discusses the introduction of the "xhigh" reasoning effort level and the trade-offs between benchmark scores and actual user latency/cost.
  • Source: infoq.com/news/2025/12/openai-gpt-51/

\

Market Opportunity
Propy Logo
Propy Price(PRO)
$0.3676
$0.3676$0.3676
+2.65%
USD
Propy (PRO) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Son of filmmaker Rob Reiner charged with homicide for death of his parents

Son of filmmaker Rob Reiner charged with homicide for death of his parents

FILE PHOTO: Rob Reiner, director of "The Princess Bride," arrives for a special 25th anniversary viewing of the film during the New York Film Festival in New York
Share
Rappler2025/12/16 09:59
Addressing the sustainability question: The Web3 energy narrative

Addressing the sustainability question: The Web3 energy narrative

The post Addressing the sustainability question: The Web3 energy narrative appeared on BitcoinEthereumNews.com. contributor Posted: September 22, 2025 The environmental impact of blockchain technology remains a significant public concern in September 2025. For Web3 to achieve widespread legitimacy, it must present a credible narrative and technological path towards sustainability. The models pioneered by Oraichain, Pinlink, and RSS3 showcase how decentralized networks can be designed for efficiency and can contribute to a more sustainable digital economy. Oraichain, as a sovereign Layer 1, is built on a Delegated Proof-of-Stake (DPoS) consensus mechanism. This is inherently more energy-efficient than the Proof-of-Work systems that drew early criticism. By design, its security model relies on economic staking rather than raw computational power, allowing the network to process complex AI computations with a minimal energy footprint compared to its predecessors, aligning its operations with a greener Web3. Pinlink’s DePIN model promotes a more efficient use of existing hardware resources. The relentless construction of massive, power-hungry data centers by tech giants is a major source of energy consumption. Pinlink’s approach is to unlock the value in dormant or underutilized GPUs already in circulation around the world. This “recycling” of computing capacity reduces the need for new hardware manufacturing and makes the overall digital infrastructure ecosystem more resource-efficient. RSS3 contributes to sustainability through its distributed and lightweight design. Unlike a centralized data indexer that requires massive, concentrated server farms, the RSS3 network is run by a global collection of independent nodes. These nodes can be operated on low-power, consumer-grade hardware, distributing the energy load and avoiding the inefficiencies of large-scale, centralized data centers. This architectural choice makes its information layer inherently more sustainable and resilient. Disclaimer: This is a paid post and should not be treated as news/advice. Next: As Bitcoin’s sell pressure grows, are investors seeking safety in altcoins? Source: https://ambcrypto.com/addressing-the-sustainability-question-the-web3-energy-narrative/
Share
BitcoinEthereumNews2025/09/23 09:02
Alcohol Still Leads Restaurant Beverage Orders, According To Harris Poll

Alcohol Still Leads Restaurant Beverage Orders, According To Harris Poll

The post Alcohol Still Leads Restaurant Beverage Orders, According To Harris Poll appeared on BitcoinEthereumNews.com. A new Harris Poll reveals millennials and Gen X still drive alcohol sales in restaurants, while Gen Z mixes drinks, formats, and expectations. Alcohol may still be the default for many American diners, but the latest Harris Poll suggests drinking habits are shifting. While older generations continue to reach for beer, wine, and cocktails, Gen Z is redefining what it means to drink out, focusing more on flexibility, aesthetics, and mood than tradition. Millennials are still loyal alcohol buyers when dining out, but Gen Z’s beverage habits are harder to pin down, according to new Harris Poll data. getty What the new Harris Poll reveals about U.S. beverage behavior In a nationally representative survey conducted by Harris in partnership with eMarketer, 36 percent of Americans reported that alcohol is their preferred restaurant beverage, slightly ahead of soda at 29 percent and water at 21 percent. But in practice, the most commonly ordered items are still non-alcoholic: 89 percent said they ordered water in the past 30 days, and 78 percent ordered soda. Alcohol remains a strong presence, with 69 percent of diners saying they ordered at least one alcoholic drink recently. Cocktails topped the alcohol category, followed by beer, spirits, and wine. While the overall preference is clear, the details begin to diverge once you look at generational breakdowns. Millennials still drive alcohol sales, especially with repeat orders Millennials continue to be the most reliable customers for restaurants selling alcohol. Fifty percent say alcohol is their default drink when dining out, compared to just 25 percent of Gen Z. They also reported significantly more repeat orders over the past month—especially for beer, spirits, and wine. This makes millennials a priority for alcohol brands and on-premise sales strategies. Libby Rodney, the Chief Strategy Officer at The Harris Poll, explained it this…
Share
BitcoinEthereumNews2025/09/24 02:21