DeepSeek Cuts AI Inference Costs Up to 60 Percent with New Sparse Attention Technology

The Price War: DeepSeek’s Radical Bet on Cheaper AI

Deepseek has again slashed AI costs by up to 75%, forcing rivals to rethink their strategies and possibly opening the door for wider access to advanced technology.

BEIJING — Building AI has always come with a painful trade-off: more power means more expense. But the new release from China’s DeepSeek has just broken that rule once again, and the ripple effects could reshape the entire industry.

Just before China’s national holiday—timing that has become an inside joke among engineers watching the company’s relentless release cycle—DeepSeek dropped its latest model, V3.2-Exp. Unlike past upgrades, this one doesn’t claim to outsmart its predecessor. Instead, it takes a different gamble: delivering roughly the same performance at a fraction of the cost.

The savings are dramatic. For a 128,000-token context—about the size of a full novel—the system now processes input (cache miss) for just $0.28 per million tokens, down from $0.56, and for cache hit $0.028 from $0.07 per million tokens, a shocking 60% price reduction. Generating output, which usually costs far more, has fallen off a cliff: from $2.20 to only $1.68 per million tokens. The company’s API reflects this reality, with input costs cut in half and output costs reduced by three-quarters.

“This is price-butcher level,” one engineer joked. Others predicted the move would corner competitors and potentially wipe out weaker labs unable to match the economics.

Deepseek (github.com)

The Clever Trick Behind It

At the heart of DeepSeek’s new system lies a deceptively simple idea: not every word in a giant document needs to pay attention to every other word.

Think of a student writing a term paper. If they had to re-read the entire textbook every time they wrote a sentence, the process would take forever. That’s how traditional LLMs behave—efficient enough for short text, but absurdly costly for the massive documents today’s applications demand.

DeepSeek’s fix, called DeepSeek Sparse Attention, introduces a clever shortcut. A “Lightning Indexer” quickly scans all previous tokens and scores their importance. Then, instead of crunching through the entire context, the model only applies full attention to the top 2,048 most relevant tokens.

The brilliance lies in how lightweight this indexer is. It uses simple math operations—things like ReLU instead of heavy exponential functions—and runs on low-precision FP8 arithmetic. The result: every token gets touched, but the heavy lifting is saved for the most useful ones.

“It’s like asking a librarian which chapters matter instead of reading the whole book cover to cover,” explained one researcher who studied the architecture. “The librarian isn’t perfect, but good enough to save a mountain of time.”

This change makes the expensive part of the computation grow in a straight line rather than a steep curve. Suddenly, tasks that once seemed too costly—analyzing entire codebases, legal briefs, or scientific papers—look financially realistic.

Training the Model to Spot What Counts

DeepSeek didn’t just throw this indexer into the wild. They taught it what mattered.

First, they froze their existing model and ran it with full attention to generate “gold standard” importance scores. The indexer learned by imitating those scores, effectively borrowing wisdom from the heavyweight system. Only after it mastered the basics did DeepSeek activate the sparse attention setup and train everything together.

“You can’t build something like this from scratch easily,” an engineer wrote in one technical thread. “Using a dense model to ‘teach’ a sparse one works incredibly well—and others will copy this playbook.”

Does It Perform as Well?

DeepSeek insists the new system holds its ground. Benchmark tests across reasoning, coding, and multi-agent tasks show near parity, with minor back-and-forth wins.

Still, skeptics remain. One in-depth reviewer praised the cost savings but reported noticeable trade-offs: slightly weaker performance on multi-step reasoning, less reliable math, and a tendency to give up on hard problems by leaning on shortcuts.

“Ninety percent of the performance for 25 percent of the price,” the reviewer wrote. “If cost matters, that’s a great deal.”

This raises a bigger question: now that models already brush against human-level performance in many areas, is the next frontier raw capability—or efficiency? DeepSeek is clearly betting on the latter.

The Open Debate

The release sparked feverish discussion among engineers. Some argue that even lightweight scoring of every token is still an overhead. Others wonder why DeepSeek didn’t mix sparse and full attention across layers, blending accuracy and efficiency.

There’s also the practical matter of how this approach plays with modern AI infrastructure—batching, GPU quirks, paged attention. DeepSeek tried to smooth the path by open-sourcing its prototyping code and high-performance CUDA kernels, and the community has already adapted it for Huawei’s Ascend NPUs and other domestic chips. The immediate multi-vendor support looks less like coincidence and more like strategy, especially as China seeks independence from foreign AI hardware.

For now, DeepSeek is keeping both versions live until October 15, 2025, giving developers a chance to A/B test for themselves.

Bigger Picture

The release isn’t happening in a vacuum. With Western export controls choking access to top-tier AI chips, Chinese firms must squeeze more performance out of what they have. DeepSeek’s model proves that clever algorithms can partly offset hardware disadvantages.

For businesses, this shift is huge. AI assistants that need to remember sprawling conversations, code helpers that must read entire repositories, and document analyzers for lengthy reports—all of these suddenly become affordable at scale.

“The agent era still needs more speed,” one developer commented, summing up the mood. Lower cost unlocks fresh possibilities, but it also shows how far the technology still has to go.

Observers describe DeepSeek as steady, understated, and genuinely innovative—far from flashy, but consistent in delivering breakthroughs. Their habit of launching major releases right before holidays has even become a running gag: engineers joke the company “ends vacations one release at a time.”

What’s Next?

For developers working with long contexts—32,000 to 128,000 tokens—the message is clear: test V3.2-Exp now. The potential savings are too big to ignore, even with a few rough edges.

For the industry, DeepSeek’s experiment may mark a turning point. If sparse attention proves competitive, other labs will face a tough choice: embrace the method or slash prices. Either way, the assumption that inference costs are fixed just got shattered.

And for the wider world, cheaper inference could matter as much as smarter models. If smaller companies and individual developers can finally afford to build at scale, the pace of innovation could quicken in unexpected directions.

NOT INVESTMENT ADVICE