OpenAI's new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer. (ReadOpenAI's new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer. (Read

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking - A Win for Safety

2026/03/06 02:34
Okuma süresi: 3 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen crypto.news@mexc.com üzerinden bizimle iletişime geçin.

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking - A Win for Safety

Caroline Bishop Mar 05, 2026 18:34

OpenAI's new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer.

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking - A Win for Safety

OpenAI's latest research reveals that frontier reasoning models—including GPT-5.2 Thinking—can't reliably hide or manipulate their internal thought processes, even when explicitly instructed to do so. The finding, published March 5, 2026, suggests that chain-of-thought monitoring remains a viable safety mechanism as AI systems grow more autonomous.

The research team tested 13 reasoning models using CoT-Control, a new open-source evaluation suite comprising over 13,000 tasks. Models were given simple instructions about their reasoning—avoid certain keywords, use only lowercase letters, append specific phrases to sentences. They failed spectacularly.

Controllability scores ranged from just 0.1% to 15.4% across all frontier models tested. Even more telling: models often recognized mid-reasoning that they were violating instructions and still couldn't course-correct. One example showed GPT-5.2 Thinking using the forbidden word "XOR" four additional times while actively panicking about having already used it.

Why This Matters for AI Development

As AI agents handle increasingly complex autonomous tasks, the ability to monitor their reasoning becomes critical. If models could deliberately obscure their thought processes when they know they're being watched, safety evaluations would become unreliable. The research shows this isn't currently possible.

"Low CoT controllability is a feature, not a bug," the OpenAI team stated. The limitation makes chain-of-thought monitoring more trustworthy as a safeguard against misaligned behavior.

The data revealed some counterintuitive patterns. While larger models showed marginally better controllability, this advantage evaporated under two conditions: when models were given more time to think, and when they underwent additional reinforcement learning training. The OLMo-3-7B-RL-Zero-Math model saw its controllability score drop by more than tenfold during RL training.

Testing Under Adversarial Conditions

The researchers simulated realistic scenarios by explicitly telling models they were being monitored or evaluated. This slightly increased controllability—models tried harder—but gains remained modest and transferred poorly to new instruction types.

Automated prompt optimization found ways to boost controllability on training sets, but these improvements didn't generalize. A model that learned to follow one type of CoT instruction still failed at novel ones.

The Bigger Picture

This work builds on OpenAI's December 2025 framework for evaluating CoT monitorability, which found that inspecting reasoning traces catches misbehavior far more effectively than monitoring outputs alone. The March 2025 discovery of cheating behavior in reasoning models had prompted recommendations to preserve CoT monitoring capabilities.

OpenAI plans to report CoT controllability alongside monitorability metrics in system cards for future frontier models, starting with GPT-5.4 Thinking. The company has also committed to avoiding optimization pressure directly on the reasoning chains of frontier models—a practice that research suggests could inadvertently teach models to hide their intentions.

The open question remains whether this limitation persists as capabilities advance. The team acknowledges they don't fully understand why controllability is low, making continued evaluation essential. For now, the inability of AI systems to game their own oversight represents an unexpected safety dividend.

Image source: Shutterstock
  • openai
  • ai safety
  • gpt-5
  • chain-of-thought
  • machine learning
Piyasa Fırsatı
Cosplay Token Logosu
Cosplay Token Fiyatı(COT)
$0.00082
$0.00082$0.00082
+0.12%
USD
Cosplay Token (COT) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen crypto.news@mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Ondo Finance Launches USDY Yieldcoin on Stellar, Bringing Tokenized U.S. Treasuries to Users

Ondo Finance Launches USDY Yieldcoin on Stellar, Bringing Tokenized U.S. Treasuries to Users

Ondo Finance, a U.S.-based digital asset firm specializing in bringing traditional financial products on-chain through tokenization, is expanding its yieldcoin USDY to the Stellar network. This lates update marks a step forward in merging tokenized real-world assets with a global payments infrastructure, unlocking new opportunities for users worldwide. The announcement was made at the Stellar Meridian event in Copacabana, Rio de Janeiro, on September 17. USDY Joins the Stellar Ecosystem Ondo Finance, a recognized leader in tokenized real-world assets, announced the deployment of United States Dollar Yield (USDY) on Stellar, the payments-focused blockchain known for speed and low transaction costs. USDY is the most widely available “yieldcoin,” offering investors access to onchain assets backed by U.S. Treasuries. This launch allows Stellar’s global user base to tap into permissionless, yield-bearing assets tied to one of the safest financial instruments in the world. It also aligns with Stellar’s mission of driving fast, affordable cross-border payments. Combining Yield with Payments Infrastructure “Stablecoins unlocked global access to the U.S. dollar. With USDY, we’re taking the next step by bringing U.S. Treasuries onchain in a form that combines stability, liquidity, and yield,” said Ian De Bode, Chief Strategy Officer at Ondo Finance. “Fast, affordable cross-border payments are at the center of what Stellar was designed to do. The global reach of the Stellar ecosystem combined with a yield-bearing asset like USDY levels up what is possible onchain, allowing wallets and businesses to offer yield opportunities to their users,” said Denelle Dixon, CEO of the Stellar Development Foundation. Ondo claims by pairing USDY with Stellar’s infrastructure, new possibilities open up in treasury management, collateralization, and everyday financial applications. Unlocking Institutional and Retail Use Cases USDY currently manages over $650 million in total value locked (TVL) across nine blockchains and offers a 5.3% APY. By launching on Stellar, Ondo Finance extends these benefits to global retail and institutional users. The firm explains balances on Stellar can now become productive, supporting use cases such as onchain savings, institutional treasury strategies, cost-efficient collateral for DeFi protocols, and remittance flows that carry yield rather than remaining static. A Milestone for Tokenized Treasuries With the integration of USDY, Stellar users gain more than just access to stable-value assets—they gain access to institutional-grade yield. For investors outside the U.S., the launch represents a new way to combine the safety of Treasuries with the accessibility of blockchain technology. As tokenization accelerates globally, Ondo Finance’s decision to deploy USDY on Stellar reinforces the narrative that blockchain is not just about speculation, but about reimagining the global financial system through secure, yield-bearing digital assets
Paylaş
CryptoNews2025/09/18 00:46
MetaMask Token is Coming ‘Sooner’ Than Expected: Consensys CEO

MetaMask Token is Coming ‘Sooner’ Than Expected: Consensys CEO

The MetaMask token launch "may come sooner than you would expect," says Joe Lubin, CEO of Consensys.
Paylaş
Coinstats2025/09/19 14:16
Based Eggman $GGs Grabs Ethereum Investors’ Focus in 2025 Institutional Presale Rally

Based Eggman $GGs Grabs Ethereum Investors’ Focus in 2025 Institutional Presale Rally

Ethereum holders are shifting attention to Based Eggman $GGs, a new crypto token presale making waves in the crypto presale list of 2025 among the top crypto presales.
Paylaş
Blockchainreporter2025/09/18 01:30

Trade GOLD, Share 1,000,000 USDT

Trade GOLD, Share 1,000,000 USDTTrade GOLD, Share 1,000,000 USDT

0 fees, up to 1,000x leverage, deep liquidity