44 bookmarks. First posted by sachaa 7 days ago.


I stand by the post, including the two biggest takeaways that I highlighted (emergent chain-of-thought via pure reinforcement learning, and the power of distillation), and I mentioned the low cost (which I expanded on in Sharp Tech) and chip ban implications, but those observations were too localized to the current state of the art in AI. What I totally failed to anticipate were the broader implications this news would have to the overall meta-discussion, particularly in terms of the U.S. and China.
quantum_neural 
3 days ago by zephyr777
A useful overview of the Chinese "DeepSeek" AI model(s) just days after they hit the news.
ai  DeepSeek  china  via:lobsters 
4 days ago by mcherm
Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients. Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access, via IP banning, rate limiting, etc. Itā€™s assumed to be widespread in terms of model training, and is why there are an ever-increasing number of models converging on GPT-4o quality. This doesnā€™t mean that we know for a fact that DeepSeek distilled 4o or Claude, but frankly, it would be odd if they didnā€™t.
ai  china  economics  business  development  blog-posts 
5 days ago by mikael
DeepSeek FAQ
Monday, January 27, 2025
Listen to Podcast

Listen to this post:

Log in to listen
Itā€™s Monday, January 27. Why havenā€™t you written about DeepSeek yet?

I did! I wrote about R1 last Tuesday.

I totally forgot about that.

I take responsibility. I stand by the post, including the two biggest takeaways that I highlighted (emergent chain-of-thought via pure reinforcement learning, and the power of distillation), and I mentioned the low cost (which I expanded on in Sharp Tech) and chip ban implications, but those observations were too localized to the current state of the art in AI. What I totally failed to anticipate were the broader implications this news would have to the overall meta-discussion, particularly in terms of the U.S. and China.

Is there precedent for such a miss?

There is. In September 2023 Huawei announced the Mate 60 Pro with a SMIC-manufactured 7nm chip. The existence of this chip wasnā€™t a surprise for those paying close attention: SMIC had made a 7nm chip a year earlier (the existence of which I had noted even earlier than that), and TSMC had shipped 7nm chips in volume using nothing but DUV lithography (later iterations of 7nm were the first to use EUV). Intel had also made 10nm (TSMC 7nm equivalent) chips years earlier using nothing but DUV, but couldnā€™t do so with profitable yields; the idea that SMIC could ship 7nm chips using their existing equipment, particularly if they didnā€™t care about yields, wasnā€™t remotely surprising ā€” to me, anyways.

What I totally failed to anticipate was the overwrought reaction in Washington D.C. The dramatic expansion in the chip ban that culminated in the Biden administration transforming chip sales to a permission-based structure was downstream from people not understanding the intricacies of chip production, and being totally blindsided by the Huawei Mate 60 Pro. I get the sense that something similar has happened over the last 72 hours: the details of what DeepSeek has accomplished ā€” and what they have not ā€” are less important than the reaction and what that reaction says about peopleā€™s pre-existing assumptions.

So what did DeepSeek announce?

The most proximate announcement to this weekendā€™s meltdown was R1, a reasoning model that is similar to OpenAIā€™s o1. However, many of the revelations that contributed to the meltdown ā€” including DeepSeekā€™s training costs ā€” actually accompanied the V3 announcement over Christmas. Moreover, many of the breakthroughs that undergirded V3 were actually revealed with the release of the V2 model last January.

Is this model naming convention the greatest crime that OpenAI has committed?

Second greatest; weā€™ll get to the greatest momentarily.

Letā€™s work backwards: what was the V2 model, and why was it important?

The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The ā€œMoEā€ in DeepSeekMoE refers to ā€œmixture of expertsā€. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple ā€œexpertsā€ and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.

DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeekā€™s approach made training more efficient as well.

DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.

Iā€™m not sure I understood any of that.

The key implications of these breakthroughs ā€” and the part you need to understand ā€” only became apparent with V3, which added a new approach to load balancing (further reducing communications overhead) and multi-token prediction in training (further densifying each training step, again reducing overhead): V3 was shockingly cheap to train. DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of $2/GPU hour, comes out to a mere $5.576 million.

That seems impossibly low.

DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3 paper:

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

So no, you canā€™t replicate DeepSeek the company for $5.576 million.

I still donā€™t believe that number.

Actually, the burden of proof is on the doubters, at least once you understand the V3 architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3. Again, this was just the final run, not the total cost, but itā€™s a plausible number.

Scale AI CEO Alexandr Wang said they have 50,000 H100s.

I donā€™t know where Wang got his information; Iā€™m guessing heā€™s referring to this November 2024 tweet from Dylan Patel, which says that DeepSeek had ā€œover 50k Hopper GPUsā€. H800s, however, are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions.

Hereā€™s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; thatā€™s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.

Meanwhile, DeepSeek also makes their models available for inference: that requires a whole bunch of GPUs above-and-beyond whatever was used for training.

So was this a violation of the chip ban?

Nope. H100s were prohibited by the chip ban, but not H800s. Everyone assumed that training leading edge models required more interchip memory bandwidth, but that is exactly what DeepSeek optimized both their model structure and infrastructure around.

Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.

So V3 is a leading edge model?

Itā€™s definitely competitive with OpenAIā€™s 4o and Anthropicā€™s Sonnet-3.5, and appears to be better than Llamaā€™s biggest model. What does seem likely is that DeepSeek was able to distill those models to give V3 high quality tokens to train on.

What is distillation?

Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.

Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access, via IP banning, rate limiting, etc. Itā€™s assumed to be widespread in terms of model training, and is why there are an ever-increasing number of models converging on GPT-4o quality. This doesnā€™t mean that we know for a fact that DeepSeek distilled 4o or Claude, but frankly, it would be odd if they didnā€™t.

Distillation seems terrible for leading edge models.

It is! On the ā€¦ [more]
AI_madness  weaponized_world_stories 
5 days ago by henryfarrell
from Daring Fireball

Chinese AI lab DeepSeek made waves last week when they dropped their new ā€œopen reasoningā€ LLM named R1. But over the weekend the full repercussions of their achievements (plural) began to sink in across the industry. My partner-in-Dithering Ben Thompson has written an extraordinarily helpful FAQ-style explanation at Stratechery today. If you, like me, were looking at the news today and thinking ā€œJeebus what the hell is going on with this DeepSeek thing?ā€, read Thompsonā€™s piece first. Two choice excerpts, first regarding OpenAI:

R1 is notable, however, because o1 stood alone as the only
reasoning model on the market, and the clearest sign that OpenAI
was the market leader.

R1 undoes the o1 mythology in a couple of important ways. First,
there is the fact that it exists. OpenAI does not have some sort
of special sauce that canā€™t be replicated. Second, R1ā€‰ā€”ā€‰like all
of DeepSeekā€™s modelsā€‰ā€”ā€‰has open weights (the problem with saying
ā€œopen sourceā€ is that we donā€™t have the data that went into
creating it). This means that instead of paying OpenAI to get
reasoning, you can run R1 on the server of your choice, or even
locally, at dramatically lower cost.

Second, regarding DeepSeekā€™s use of distillation (using existing LLMs to train new smaller ones):

Here again it seems plausible that DeepSeek benefited from
distillation, particularly in terms of training R1. That, though,
is itself an important takeaway: we have a situation where AI
models are teaching AI models, and where AI models are teaching
themselves. We are watching the assembly of an AI takeoff scenario
in realtime.

Ā ā˜…Ā 
ifttt  daringfireball 
6 days ago by josephschmitt