Q: How did DeepSeek get around export restrictions? A: They didn’t. They just tinkered around with their chips to make sure they handled memory as efficiently as possibly. They lucked out, and their perfectly optimized low-level code wasn’t actually held back by chip capacity.

Q: How did DeepSeek train so much more efficiently? A: They used the formulas below to “predict” which tokens the model would activate. Then, they only trained these tokens. They need 95% fewer GPUs than Meta because for each token, they only trained 5% of their parameters.

Q: How is DeepSeek’s inference so much cheaper? A: They compressed the KV cache. (This was a breakthrough they made a while ago.)

Q: How did they replicate o1? A: Reinforcement learning. Take complicated questions that can be easily verified (either math or code). Update the model if correct.

Reference: https://x.com/wordgrammer/status/1883712727073607859
Tinggalkan Balasan