Pinboard: URL page for https://stratechery.com/2025/deepseek-faq/

44 bookmarks. First posted by sachaa 7 days ago.

llm machinelearning blogit
2 days ago by renaissancechambara

3 days ago by slee2004

I stand by the post, including the two biggest takeaways that I highlighted (emergent chain-of-thought via pure reinforcement learning, and the power of distillation), and I mentioned the low cost (which I expanded on in Sharp Tech) and chip ban implications, but those observations were too localized to the current state of the art in AI. What I totally failed to anticipate were the broader implications this news would have to the overall meta-discussion, particularly in terms of the U.S. and China.

quantum_neural
3 days ago by zephyr777

3 days ago by contadino

ai deepseek machinelearning
4 days ago by crgmrgn

stratechery ai cloud umn tech technology
4 days ago by ntschutta

law america software performance economics machinelearning tech business generative ai
4 days ago by devnulled

business china machinelearning llm
4 days ago by yuncheng

A useful overview of the Chinese "DeepSeek" AI model(s) just days after they hit the news.

ai DeepSeek china via:lobsters
4 days ago by mcherm

deepseek référence ia actudqjmm
4 days ago by smokababylon

Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients. Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access, via IP banning, rate limiting, etc. It’s assumed to be widespread in terms of model training, and is why there are an ever-increasing number of models converging on GPT-4o quality. This doesn’t mean that we know for a fact that DeepSeek distilled 4o or Claude, but frankly, it would be odd if they didn’t.

ai china economics business development blog-posts
5 days ago by mikael

software AI llm
5 days ago by mstrasser

DeepSeek FAQ
Monday, January 27, 2025
Listen to Podcast

Listen to this post:

Log in to listen
It’s Monday, January 27. Why haven’t you written about DeepSeek yet?

I did! I wrote about R1 last Tuesday.

I totally forgot about that.

I take responsibility. I stand by the post, including the two biggest takeaways that I highlighted (emergent chain-of-thought via pure reinforcement learning, and the power of distillation), and I mentioned the low cost (which I expanded on in Sharp Tech) and chip ban implications, but those observations were too localized to the current state of the art in AI. What I totally failed to anticipate were the broader implications this news would have to the overall meta-discussion, particularly in terms of the U.S. and China.

Is there precedent for such a miss?

There is. In September 2023 Huawei announced the Mate 60 Pro with a SMIC-manufactured 7nm chip. The existence of this chip wasn’t a surprise for those paying close attention: SMIC had made a 7nm chip a year earlier (the existence of which I had noted even earlier than that), and TSMC had shipped 7nm chips in volume using nothing but DUV lithography (later iterations of 7nm were the first to use EUV). Intel had also made 10nm (TSMC 7nm equivalent) chips years earlier using nothing but DUV, but couldn’t do so with profitable yields; the idea that SMIC could ship 7nm chips using their existing equipment, particularly if they didn’t care about yields, wasn’t remotely surprising — to me, anyways.

What I totally failed to anticipate was the overwrought reaction in Washington D.C. The dramatic expansion in the chip ban that culminated in the Biden administration transforming chip sales to a permission-based structure was downstream from people not understanding the intricacies of chip production, and being totally blindsided by the Huawei Mate 60 Pro. I get the sense that something similar has happened over the last 72 hours: the details of what DeepSeek has accomplished — and what they have not — are less important than the reaction and what that reaction says about people’s pre-existing assumptions.

So what did DeepSeek announce?

The most proximate announcement to this weekend’s meltdown was R1, a reasoning model that is similar to OpenAI’s o1. However, many of the revelations that contributed to the meltdown — including DeepSeek’s training costs — actually accompanied the V3 announcement over Christmas. Moreover, many of the breakthroughs that undergirded V3 were actually revealed with the release of the V2 model last January.

Is this model naming convention the greatest crime that OpenAI has committed?

Second greatest; we’ll get to the greatest momentarily.

Let’s work backwards: what was the V2 model, and why was it important?

The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.

DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.

DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.

I’m not sure I understood any of that.

The key implications of these breakthroughs — and the part you need to understand — only became apparent with V3, which added a new approach to load balancing (further reducing communications overhead) and multi-token prediction in training (further densifying each training step, again reducing overhead): V3 was shockingly cheap to train. DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of $2/GPU hour, comes out to a mere $5.576 million.

That seems impossibly low.

DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3 paper:

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

So no, you can’t replicate DeepSeek the company for $5.576 million.

I still don’t believe that number.

Actually, the burden of proof is on the doubters, at least once you understand the V3 architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3. Again, this was just the final run, not the total cost, but it’s a plausible number.

Scale AI CEO Alexandr Wang said they have 50,000 H100s.

I don’t know where Wang got his information; I’m guessing he’s referring to this November 2024 tweet from Dylan Patel, which says that DeepSeek had “over 50k Hopper GPUs”. H800s, however, are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions.

Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.

Meanwhile, DeepSeek also makes their models available for inference: that requires a whole bunch of GPUs above-and-beyond whatever was used for training.

So was this a violation of the chip ban?

Nope. H100s were prohibited by the chip ban, but not H800s. Everyone assumed that training leading edge models required more interchip memory bandwidth, but that is exactly what DeepSeek optimized both their model structure and infrastructure around.

Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.

So V3 is a leading edge model?

It’s definitely competitive with OpenAI’s 4o and Anthropic’s Sonnet-3.5, and appears to be better than Llama’s biggest model. What does seem likely is that DeepSeek was able to distill those models to give V3 high quality tokens to train on.

What is distillation?

Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.

Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access, via IP banning, rate limiting, etc. It’s assumed to be widespread in terms of model training, and is why there are an ever-increasing number of models converging on GPT-4o quality. This doesn’t mean that we know for a fact that DeepSeek distilled 4o or Claude, but frankly, it would be odd if they didn’t.

Distillation seems terrible for leading edge models.

It is! On the … [more]

AI_madness weaponized_world_stories
5 days ago by henryfarrell

deepseek ai review faq
6 days ago by cjitlal

DeepSeek FAQ

ai big tech chatgpt china deepseek faq generative hype openai progress technology
6 days ago by therourke

deepseek
6 days ago by nathanen

DeepSeek LLMs
6 days ago by gvs1309

llm
6 days ago by audunv

ai deepseek openai chatgpt china nvidia gpu
6 days ago by hauntedhost

6 days ago by bgmuthalaly

deepseek ai llm china
6 days ago by kangas

large-language-model
6 days ago by jugglebird

6 days ago by ryangallagher

6 days ago by pragmaticgeek

6 days ago by joostvanderborg

computers machinelearning deeplearning gpt3 techindustry china comparch deepseek
6 days ago by pozorvlak

Mac

from:RSS from:feedly
6 days ago by frenchy_rjp

6 days ago by wahbahdoo

llm deepseek nvidia
6 days ago by steveax

ai deepseek
6 days ago by 1jh

AI
6 days ago by bill.kirtley

economics ai llm technology
6 days ago by inuwali

llm deepseek
6 days ago by jbfink

from Daring Fireball

Chinese AI lab DeepSeek made waves last week when they dropped their new “open reasoning” LLM named R1. But over the weekend the full repercussions of their achievements (plural) began to sink in across the industry. My partner-in-Dithering Ben Thompson has written an extraordinarily helpful FAQ-style explanation at Stratechery today. If you, like me, were looking at the news today and thinking “Jeebus what the hell is going on with this DeepSeek thing?”, read Thompson’s piece first. Two choice excerpts, first regarding OpenAI:

R1 is notable, however, because o1 stood alone as the only
reasoning model on the market, and the clearest sign that OpenAI
was the market leader.

R1 undoes the o1 mythology in a couple of important ways. First,
there is the fact that it exists. OpenAI does not have some sort
of special sauce that can’t be replicated. Second, R1 — like all
of DeepSeek’s models — has open weights (the problem with saying
“open source” is that we don’t have the data that went into
creating it). This means that instead of paying OpenAI to get
reasoning, you can run R1 on the server of your choice, or even
locally, at dramatically lower cost.

Second, regarding DeepSeek’s use of distillation (using existing LLMs to train new smaller ones):

Here again it seems plausible that DeepSeek benefited from
distillation, particularly in terms of training R1. That, though,
is itself an important takeaway: we have a situation where AI
models are teaching AI models, and where AI models are teaching
themselves. We are watching the assembly of an AI takeoff scenario
in realtime.

★

ifttt daringfireball
6 days ago by josephschmitt

deepseek llms ai
6 days ago by hobg0blin

7 days ago by jakemcc

7 days ago by mrled

GenAI china
7 days ago by fredcavazza

7 days ago by markwatson

america china law predictions ai llm performance business software economics
7 days ago by pauldwaite

ai china machinelearning tech trends future business stratechery nvidia
7 days ago by rogerwshaw

deepseek llm AI generative news trends nvidia openai claude GPU performance hardware chatgpt
7 days ago by sachaa