DeepSeek-R1 - A quick dive and short insights
Image credit : Economoic Times

DeepSeek-R1 - A quick dive and short insights

DeepSeek-R1 which was announced last week, is an open-source large language model (LLM). What started as a side project at a Chinese hedge fund before being spun out, it is now reportedly rivaling OpenAI’s top offerings, sending shockwaves through the industry and generating much excitement in the tech world. In fact the tech giants of the US are also lauding it. Nvidia says that it is "Excellent AI advancement" and has has praised the Chinese AI model DeepSeek for its innovative approach and cost-efficiency, marking a significant milestone in AI development. It's worth nothing that the AI Chipmaker has been one of the biggest losers in the market since DeepSeek was announced. Not only Nvidia, even Microsoft one of the major investors in OpenAI, its CEO Satya Nadella says its a big win for Tech, and that DeepSeek is super impressive.

So why is it a big deal?

The most impressive thing about DeepSeek is its Low cost training and fast inference.

DeepSeek said it trained a model using a data center size of some 2000 Nvidia H800 GPUs and at cost of $5.5m. This is in comparison just peanuts with US based training models. Its accuracy is as good as a GPT and this has been proven. Not only is the training so fast, the inference is also faster, which means the time it takes to provide an answer is also faster.

Diving into more details here, DeepSeek training is 45x more efficient.

Following are the reasons : (includes technical justification)

  • Use 8 bit(FP8) instead of 32 bit(FP32) floating point numbers, which gives massive memory savings. DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network. There is a tradeoff here in terms of precision(compared to FP32), but still good enough accuracy for many AI workloads.

  • Compress the key-value indices which eat up much of the VRAM; they get 93% compression ratios. This is their most innovative development, called the Multi-head Latent Attention (MLA). Its a breakthrough in how they handle KV indices, which is basically how individual tokens are represented within the transformer architecture.

  • Do multi-token prediction instead of single-token prediction which effectively doubles inference speed. This is another major breakthrough, where other LLMs do inference by prediciting the next token (one at a time), while DeepSeek figured out how to predict multiple tokens and they achieved this with 85-90% accuracy. The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing— it's making structured, contextual predictions.

  • Mixture of Experts model decomposes a big model into small models that can run on consumer-grade GPUs. Even though DeepSeek uses 671B parameters (which is bigger than even LLama3 model), only 37B of parameters are active at any given time, enough to fit in the VRAM of 2 consumer grade Nvidia 4090 GPUs (under $2,000 each), rather than requiring one or more H100 GPUs which cost something like $40k each.

The above improvements translate into tremendous performance on ground, and the proof is in this benchmark test results:

Benchmark performance of DeepSeek-R1

The above information and the entire model summary, approach and experiments is available open for anyone to download and read from their github page.

https://github.com/deepseek-ai/DeepSeek-R1/tree/main

There are quite a few tweets and articles written by many in Silicon valley esp., about DeepSeek, but this one from Pat Gelsinger (former intel ceo) is important to understand that we are in at a very important point. (As many call it an AI Sputnik moment!)

Market Reaction

Let's talk about the reaction in WallStreet on this DeepSeek announcement, which clearly shows the impact it has created. In fact till DeepSeek had hit the news, the US was talking about the Stargate project with an investment of about $500B to build "a country of geniuses in a data center". But then there came the twist in the plot, almost very next day. Ironically this small chinese startup wiped out $500B of Nvidia market cap alone. The overall estimates is around 1 trillion USD of marketcap being wiped out. What is incredible is that all this was achieved not even at a hardware level but at a software level with efficient algorithms.

Data safety

There is a lot of concern about data safety, protection etc. But its pretty straightforward.

As long as you are going to use the website https://www.deepseek.com/ then obviously the data is computed in servers located in China. And so the data will be stored/used in China.

But if you want to protect your data, then there are options available to locally download using ollama and use it offline. Or it can be used using groqcloud which is US website. They are hosting DeepSeek R1 and the data then is into Groq's cloud. Perplexity is also hosting DeepSeek R1 model. None of the data then goes to China, if that is your concern!

Skepticism

There is a lot of criticism and one of the popular opinion is that

"Deepseek rides on first generation LLMs like OpenAI to run its ‘reasoning’ engine. So in a way Open AI and other LLMs laid the nice foundational bed of ‘learning’ from all the world’s data (which has been a resource intensive exercise) to create the knowledge reservoir for a reasoning model like Deepseek to tap into. So, it is natural that the reasoning model is less resource intensive and cheaper to run than the knowledge reservoir. "

But all these following facts bust this opinion:

  1. Distillation, which is using LLM output data to train other LLM is banned by the terms of service of OpenAI.

  2. Both chatGPT and DeepSeek have been trained on data from internet. The internet today has a lot of chatGPT output, which made its way into DeepSeek.

  3. DeepSeek's efficiency doesnt come from available training data set but from "how" of the training. DeepSeek uses reinforcement learning and MOE instead of RLHF with labelling the data, the latter being more labour intensive.

  4. Even assuming that chatGPT data was stolen and used for training, it would only explain the lower development cost but not lower 'running' costs!

But there is one fair point as mentioned by Sam Altman which we need to take in here,

"Its easier to innovate on something that you know works compared to creating something new".

But it does not take away any credit away from the brilliance that DeepSeek is.

Also quoting Aravind Shrinivas(CEO or Perplexity) tweet here -

"There’s a lot of misconception that China “just cloned” the outputs of openai. This is far from true and reflects incomplete understanding of how these models are trained in the first place. DeepSeek R1 has figured out RL finetuning. They wrote a whole paper on this topic called DeepSeek R1 Zero, where no SFT was used. And then combined it with some SFT to add domain knowledge with good rejection sampling (aka filtering). The main reason it’s so good is it learned reasoning from scratch rather than imitating other humans or models."

For all the techies interested in more details, lets dive a bit deeper to understand what he says. The following is the Training flow diagram, (thanks to @bookwormengr)

DeepSeek Training flow

DeepSeek R1 follows a very intelligent combination of Reasoning Orientated RL(Reinforcement learning) and SFT(Supervised Fine-Tuning). They followed very intelligent approach to generate 800K high quality data pair before final SFT. With that quantity of high quality data, they don't have to steal OpenAI/anyone's data.

The one skepticism that I agree with is the censorship of data. The politics of China and sensitive content related to China is completely censored. But on the flipside, its the most neutral model right now with American politics. :)

What does future hold from here

Democratization of AI: AI models have started becoming closed in the Open AI labs. LLama is open but not as great as OpenAI models. But come Deepseek, that has a very good performance which is now available as an open model.

The result of this is that there is going to be faster adoption. Smaller companies can use it at very less cost. (ofc cloud bill is there) But it reduces the overall cost of building AI solution significantly.

Sustainability: Power consumption is always a problem with AI and Gen AI related technologies, for which there has been heavy crticism. Given DeepSeek is showing such vast improvements in compute, going forward we will see significant decrease in power consumption. This is in the background where the American counterparts are discussing about nuclear energy option as a power source! This implies that CO2 emission will be reduced, a great positive for the planet wellbeing.

For individual engineers: Companies will make faster adoption and this means more Gen AI developers and engineers will be needed. So engineers reading this, please upskill yourselves and gear up for the opportunities that will arise.

Geopolitical momentum: Well its very clear that China is gaining momentum in AI against US. More AI models coming from Qwen (Alibaba), MiniMax, Kimi, DuoBao (ByteDance) all from China. Lets understand that DeepSeek is not unique and their competition is close behind (not far behind), and not from outside but within their own country. But all this means is that China is far far ahead of the rest.

So now, coming to the inevitable question, why is India not able to create a DeepSeek?

Firstly, AI is not that hard as one might presume. Unlike building OS or database software, its easier to build, because machines learn by themselves provided you give enough data and compute. Also, everyone trains on same data: internet archives, books, github code, for the first stage called “pre-training”. The LLM science part is actually quite easy. All these models are “Transformer Decoder only models”, an architecture that was invented in late 2017. Since then minor improvements have been made, but they are all open source and easy to implement.

So what is the hard part then?

It is the parallel & distributed computing to run AI training jobs across thousands of GPUs that is hard. DeepSeek did lot of innovation here to save on “flops” and network calls. They used an innovative architecture called Mixture of Experts and a new approach called GRPO. with verifiable rewards both of which are in open domain through 2024.

Then why India does not have foundation models?

There is no protected market to practice your craft in early days. You will get replaced by American service providers as they are cheaper and better every single time. That is not the case with Chinese player. They have a protected market and leadership who treats this skillset as existential due to geopolitics. So, even if Chinese models are not good in early days they will continue to get funding from their conglomerates as well as provincial governments.

DeepSeek took 2 years to get here without much revenue. They were funded by their parent. And most of their engineers are not even PhDs.

What we need? We need a national fund that will fund such teams and the only expected output will be benchmark performance with benchmarks becoming harder every 6 months . No revenue needed to survive for first 3 years. Only then are we going to see an AI model birthing out from here.

Karan Sinha

Principal Product Manager @ Oracle | Product Leadership, Mentorship | 0-1 and 1-100 Products | Consumer Apps, Enterprise Platforms, Hardware IoT, Telecom, SaaS | Launch, Onboard, Activate, Engage, Monetize

7mo

Good article there Kamal! One additional thought that has been lingering with me for some time: Over the years, we've seen value-creation has moved from companies playing at the infra-layer to those at the application layer, until a new technology breakthrough emerged - this was true with the PC (Intel was the king to start with; Microsoft took over the baton), Internet (all telco infra in the early days; SaaS apps later), and then Mobile (Apple, Google with their OEM/OS models; apps!). Deepseek is another signal that this value shift from foundational models and chip manufacturers, is extremely ripe to move to inference and its applications in the AI-era. The world is still figuring out and (unlike previously) there won't be one true way these new era applications would solve customer problems. For the amazing Indian tech ecosystem, my view is that it has been playing 'more heavily' at the application layer over the years. There's no wrong in that because value creation organically reaches that layer. But that's where the conundrum lies - should the ecosystem stay in that value-spectrum and solve real problems with contextual AI applications, or broaden the value-spectrum this time and play big at the infra layer too?

To view or add a comment, sign in

Others also viewed

Explore content categories