Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments. (ReadTogether AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments. (Read

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

2026/04/02 03:17
Okuma süresi: 4 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen crypto.news@mexc.com üzerinden bizimle iletişime geçin.

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

Timothy Morano Apr 01, 2026 19:17

Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments.

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

The team behind FlashAttention has quietly become one of the most consequential groups in AI infrastructure. Together AI's kernel research unit, now about 15 engineers strong, is solving a problem most people don't even know exists: the massive performance gap between AI models and the hardware running them.

Their latest win? Taking a voice AI company's time-to-first-token from 281ms down to 77ms—a 3.6x improvement that translated to 7.2x better unit economics.

The Hidden Bottleneck

Here's what most AI discourse misses: having great models and expensive GPUs doesn't guarantee performance. The bottleneck sits in between—the kernel layer that translates mathematical operations into actual silicon instructions.

"The gap between what researchers design and what actually runs fast on hardware is vast," explains Dan Fu, who leads a parallel research lab at UCSD. Get kernels right and you unlock hardware's full potential. Get them wrong and your expensive GPUs sit partially idle.

For companies building AI-native products, this isn't academic. When inference costs run 2x higher than necessary, or when latency breaks the user experience, kernel optimization becomes existential.

One Week Versus One Year

The team's capabilities showed clearly when NVIDIA's Blackwell GPUs arrived in March 2025. NVIDIA had spent a year with dozens of engineers optimizing kernels for the new architecture. Together AI had a week.

Their secret weapon: ThunderKittens, a library developed with Stanford researchers that reduces kernel code from 1,000+ lines of CUDA to roughly 100-200 lines. The abstraction layer is built around NVIDIA's tensor cores, the specialized matrix multiplication units on modern GPUs.

Within seven days of hardware access, the team had some of the fastest FP4 and FP8 GEMM kernels available for Blackwell, achieving up to 2x speedups over cuBLAS on H100s.

Real-World Impact

The voice AI case study illustrates what this means in production. The customer had a hard constraint: time-to-first-64-tokens above roughly 100ms breaks conversational flow. Their B200 deployment was hitting 281ms.

Together's team hand-optimized a "Megakernel" implementation—running an entire model in a single kernel, targeting the HBM bandwidth ceiling of NVIDIA H100s. Results on Llama-3.2-1B: 77ms. On Qwen 2.5 1.5B: 127ms, down from 292ms.

The approach traces back to FlashAttention's original insight. That Memorial Day 2022 paper proved the AI establishment wrong about attention being fully optimized. By applying database systems principles—data locality, memory hierarchies—to transformer attention, the team achieved 2-3x speedups where previous sparsity methods showed only 10% real gains.

Academic-Industry Pipeline

The team operates through an unusual model. Dan Fu runs his UCSD lab on higher-risk fundamental research. Together AI co-founder Tri Dao is at Princeton. Simran Arora is at Caltech. Ideas get de-risked in academia, then productionized at Together AI. PhD students join the company. Interns work on longer-term research in academic labs.

This produces engineers who bridge theory and production—people who, as Fu puts it, "lose sleep over memory access patterns" and "find beauty in data flow diagrams."

The work isn't glamorous. No announcements when a kernel optimization lands. Just faster training times, lower costs, higher throughput. But these margins determine whether AI-native products feel instant or sluggish, whether unit economics work or don't, whether companies scale to millions of users or plateau at thousands.

For enterprise AI deployments where every millisecond matters—and every percentage point of efficiency translates to significant cost savings—this invisible infrastructure layer may be where the real competitive advantage lies.

Image source: Shutterstock
  • together ai
  • gpu optimization
  • nvidia
  • ai infrastructure
  • machine learning
Piyasa Fırsatı
Major Logosu
Major Fiyatı(MAJOR)
$0.06118
$0.06118$0.06118
+1.91%
USD
Major (MAJOR) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen crypto.news@mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Ondo Finance Launches USDY Yieldcoin on Stellar, Bringing Tokenized U.S. Treasuries to Users

Ondo Finance Launches USDY Yieldcoin on Stellar, Bringing Tokenized U.S. Treasuries to Users

Ondo Finance, a U.S.-based digital asset firm specializing in bringing traditional financial products on-chain through tokenization, is expanding its yieldcoin USDY to the Stellar network. This lates update marks a step forward in merging tokenized real-world assets with a global payments infrastructure, unlocking new opportunities for users worldwide. The announcement was made at the Stellar Meridian event in Copacabana, Rio de Janeiro, on September 17. USDY Joins the Stellar Ecosystem Ondo Finance, a recognized leader in tokenized real-world assets, announced the deployment of United States Dollar Yield (USDY) on Stellar, the payments-focused blockchain known for speed and low transaction costs. USDY is the most widely available “yieldcoin,” offering investors access to onchain assets backed by U.S. Treasuries. This launch allows Stellar’s global user base to tap into permissionless, yield-bearing assets tied to one of the safest financial instruments in the world. It also aligns with Stellar’s mission of driving fast, affordable cross-border payments. Combining Yield with Payments Infrastructure “Stablecoins unlocked global access to the U.S. dollar. With USDY, we’re taking the next step by bringing U.S. Treasuries onchain in a form that combines stability, liquidity, and yield,” said Ian De Bode, Chief Strategy Officer at Ondo Finance. “Fast, affordable cross-border payments are at the center of what Stellar was designed to do. The global reach of the Stellar ecosystem combined with a yield-bearing asset like USDY levels up what is possible onchain, allowing wallets and businesses to offer yield opportunities to their users,” said Denelle Dixon, CEO of the Stellar Development Foundation. Ondo claims by pairing USDY with Stellar’s infrastructure, new possibilities open up in treasury management, collateralization, and everyday financial applications. Unlocking Institutional and Retail Use Cases USDY currently manages over $650 million in total value locked (TVL) across nine blockchains and offers a 5.3% APY. By launching on Stellar, Ondo Finance extends these benefits to global retail and institutional users. The firm explains balances on Stellar can now become productive, supporting use cases such as onchain savings, institutional treasury strategies, cost-efficient collateral for DeFi protocols, and remittance flows that carry yield rather than remaining static. A Milestone for Tokenized Treasuries With the integration of USDY, Stellar users gain more than just access to stable-value assets—they gain access to institutional-grade yield. For investors outside the U.S., the launch represents a new way to combine the safety of Treasuries with the accessibility of blockchain technology. As tokenization accelerates globally, Ondo Finance’s decision to deploy USDY on Stellar reinforces the narrative that blockchain is not just about speculation, but about reimagining the global financial system through secure, yield-bearing digital assets
Paylaş
CryptoNews2025/09/18 00:46
MetaMask Token is Coming ‘Sooner’ Than Expected: Consensys CEO

MetaMask Token is Coming ‘Sooner’ Than Expected: Consensys CEO

The MetaMask token launch "may come sooner than you would expect," says Joe Lubin, CEO of Consensys.
Paylaş
Coinstats2025/09/19 14:16
Based Eggman $GGs Grabs Ethereum Investors’ Focus in 2025 Institutional Presale Rally

Based Eggman $GGs Grabs Ethereum Investors’ Focus in 2025 Institutional Presale Rally

Ethereum holders are shifting attention to Based Eggman $GGs, a new crypto token presale making waves in the crypto presale list of 2025 among the top crypto presales.
Paylaş
Blockchainreporter2025/09/18 01:30

Trade GOLD, Share 1,000,000 USDT

Trade GOLD, Share 1,000,000 USDTTrade GOLD, Share 1,000,000 USDT

0 fees, up to 1,000x leverage, deep liquidity