Tuesday, January 28, 2025

Deepseek low cost AI hardware design

 28 January 2025, by ChatGPT:

Summary: 

DeepSeek, a Chinese AI company, has set new performance benchmarks with their models, DeepSeek-V3 and DeepSeek-R1, despite U.S. export restrictions on powerful chips.

  • Used H800 GPUs (less expensive than H100).
  • Innovative "Mixture of Experts" approach to reduce computation time and memory usage.
  • Training cost of DeepSeek-V3 was $5.576 million, much lower than GPT-4’s $100 million.

Their chatbot is now the top free app on Apple’s ranking, challenging AI resource assumptions and impacting Nvidia’s stock as investors reconsider the demand for high-cost GPUs.

DeepSeek’s low cost achievements highlight how innovation can overcome hardware limitations through creative software and model design. 

Let’s break this down further:

Key Takeaways from DeepSeek’s Success

  1. Adaptation to Hardware Constraints:

    • By using the H800 chips (less powerful than the H100), DeepSeek demonstrates that state-of-the-art AI doesn’t always require the latest and most expensive GPUs. Their approach challenges the notion that cutting-edge AI depends solely on access to top-tier hardware.
  2. "Mixture of Experts" Technique:

    • The mixture of experts (MoE) approach is a highly efficient way to optimize AI models. Instead of activating all parameters for every query, only a subset of the model is utilized based on the specific task. This reduces:
      • Memory usage.
      • Computation time.
      • Energy consumption.
    • It’s a smart way to maximize performance on hardware with limited resources.
  3. Cost Efficiency:

    • Training DeepSeek-V3 for $5.576 million compared to GPT-4’s reported $100 million shows a monumental improvement in cost-effectiveness. This could democratize AI development by making high-performance models accessible to companies with smaller budgets.
  4. Commercial Success:

    • DeepSeek's chatbot being the top free app on Apple's ranking signifies significant user adoption and satisfaction. This could be due to its performance, efficiency, and possibly lower operational costs passed on as savings to users.
  5. Market Impact:

    • DeepSeek's breakthrough is reshaping industry expectations, including those of investors. Nvidia’s stock reaction indicates that demand for ultra-high-end GPUs like the H100 may be reconsidered, especially if cost-efficient solutions like the H800 or custom hardware can deliver comparable performance.

Implications for the AI Industry

  • Resource Efficiency: DeepSeek’s success suggests that innovation in AI isn’t solely tied to hardware but can also come from algorithmic improvements and smarter model architectures.
  • Global Competition: Despite export restrictions, companies like DeepSeek are showing they can remain competitive, signaling a shift in how global AI capabilities are evaluated.
  • Broader Accessibility: With reduced costs for training and deployment, smaller players in the AI industry may be inspired to adopt similar approaches, fostering innovation at all levels.


Why Use GPUs?

Even with constraints, GPUs remain the primary hardware for AI because:

  1. Parallel Processing: GPUs excel at parallelism, crucial for matrix operations and large-scale computations in AI training and inference.
  2. Scalability: H800 GPUs, though less advanced, can be scaled across multiple units to handle large models.
  3. Optimization: Techniques like the "mixture of experts" used by DeepSeek allow better utilization of GPU resources, making them sufficient for high-performance AI.

Differences Between H800 and H100 GPUs

  • Performance: The H800 has lower memory bandwidth and reduced overall compute capacity compared to the H100.
  • Export Compliance: The H800 is specifically designed to comply with U.S. export restrictions, making it available for use in countries like China.
  • Energy Efficiency: The H800 may consume less power, which aligns well with DeepSeek’s focus on cost-effective training and deployment.

How DeepSeek Optimized Their GPU Usage

DeepSeek’s innovation isn’t about avoiding GPUs but rather maximizing efficiency:

  1. Mixture of Experts (MoE): Instead of using all parameters in their models, they dynamically activate only the parts needed for specific queries, reducing GPU load.
  2. Resource Management: By designing models to require fewer computations per task, they avoid pushing GPUs to their limits.
  3. Distributed Computing: Likely, DeepSeek used clusters of H800 GPUs to parallelize workloads effectively.

Could They Use Something Other Than GPUs?

In the future, companies like DeepSeek might explore alternatives such as:

  • AI-Specific Chips: Custom ASICs (Application-Specific Integrated Circuits) or TPUs (Tensor Processing Units) optimized for their models.
  • RISC-V Architecture: Open-source hardware platforms allowing custom AI accelerators.

For now, GPUs like the H800 remain a cornerstone for AI workloads, and DeepSeek's innovations show how to make the most out of them. 



1. Basics of AI Hardware

AI hardware is optimized for matrix operations (e.g., multiplying large arrays of numbers) and parallel processing. The core components are:

  • Processing Units:

    • GPUs (Graphics Processing Units): The backbone of most AI hardware, designed for parallel computation. Examples: NVIDIA H100, H800.
    • TPUs (Tensor Processing Units): Custom chips designed by Google for tensor-heavy workloads.
    • ASICs (Application-Specific Integrated Circuits): Highly specialized chips for specific AI tasks.
    • FPGAs (Field-Programmable Gate Arrays): Flexible hardware that can be programmed for various AI tasks but requires more customization.
  • Memory (RAM):

    • High-bandwidth memory (HBM) is critical for storing and processing massive datasets. Faster memory reduces bottlenecks.
  • Storage:

    • NVMe SSDs: Used for high-speed access to training data.
    • Larger storage arrays are needed for large-scale AI models.
  • Interconnects:

    • High-speed connections between GPUs and other components. Examples: NVIDIA’s NVLink or PCIe.

2. Most Expensive Part of AI Hardware

The most expensive part is usually the GPUs or specialized processors (e.g., TPUs or ASICs), followed by memory. Here's why:

  1. GPUs:

    • High-end GPUs like NVIDIA’s H100 can cost $30,000–$40,000 per unit.
    • Large-scale AI systems use thousands of GPUs, which makes hardware costs skyrocket.
  2. Memory:

    • High-bandwidth memory (HBM) is very expensive but essential for AI workloads.
    • AI systems with large models require a lot of memory to store weights and activations.
  3. Energy Costs (Operational Expense):

    • Powering and cooling the GPUs is another major expense, especially in high-power data centers.
  4. Infrastructure:

    • Building and maintaining data centers with the necessary networking and storage adds to costs.

3. How DeepSeek Reduced Costs

DeepSeek achieved cost savings by addressing three key areas: hardware usage, model efficiency, and operational costs.

A. Optimizing Hardware Usage

  • Using H800 GPUs:
    • DeepSeek used the less powerful and cheaper NVIDIA H800 GPUs, rather than the top-of-the-line H100 GPUs used by companies like OpenAI.
    • The H800 has lower peak performance but costs significantly less and consumes less energy.
  • Scalability:
    • Instead of relying on the raw power of individual GPUs, they likely scaled their workload across many GPUs, optimizing utilization through distributed training.

B. Efficient Model Design

  • Mixture of Experts (MoE):

    • DeepSeek used this innovative architecture where only parts of the model ("experts") are activated for each query.
    • This drastically reduces the number of computations per task, cutting down on the need for GPU power and memory.
    • MoE allows a smaller subset of the model to be trained or used at a time, saving time and energy.
  • Lighter Models:

    • By designing a model that uses fewer parameters for similar performance, they avoided the computational intensity of models like GPT-4.

C. Lower Operational Costs

  • Energy Efficiency:
    • By reducing computation, DeepSeek lowered the energy needed for both training and inference, reducing electricity and cooling costs.
  • Reduced Training Time:
    • Their efficient architecture likely allowed them to train models faster, cutting down on the expensive GPU hours required.

D. Cost Awareness in Chip Selection

  • U.S. Export Limitations indirectly forced DeepSeek to focus on the H800, but their cost-conscious chip choice inadvertently benefited their budget without sacrificing performance.

4. How This Compares to OpenAI

AspectDeepSeekOpenAI (GPT-4)
HardwareH800 GPUs (cheaper)
0H100 GPUs (top-tier)
Training Cost~$5.6 million
9~$100 million
Model EfficiencyMixture of Experts (MoE)
8Fully dense model (all parameters active)
Energy UsageOptimized, reduced computation
7Higher energy consumption

Conclusion

DeepSeek reduced costs by:

  1. Using less expensive hardware (H800 GPUs).
  2. Employing a more efficient architecture (Mixture of Experts) to reduce computation.
  3. Lowering energy and operational expenses through optimized usage.

This approach shows that smart design can replace brute force when creating AI models for certain areas.

However, performance remains the most challenging aspect of AI development. Even a 5% increase in performance can require significant resources, and cost reduction cannot replace the need for high-quality, reliable AI models for complex tasks. ChatGPT (another example is Merlin AI which primarily utilizes models from OpenAI and Google* ), with their superior performance in coding, NLP, and complex tasks, proves that higher costs are often justified by their accuracy, reliability, and versatility

While DeepSeek’s innovations demonstrate how software design can overcome hardware constraints, performance will always be the key driver in AI success.

*Note by the poster: I use the free latest versions of ChatGPT and Merlin. I appreciate ChatGPT’s coding capabilities (including Python, MATLAB, HTML, JavaScript, Java, CSS, C++, C, C#, React, Node.js, SQL, PHP, Ruby, R, Perl, Shell scripting, and more), as it maintains consistent performance and never disappoints. As of January 2025, I am particularly impressed by Merlin’s business model, which allowed me to upload and analyze three different large PDF files as a single input. This capability—processing multiple large files with different formats, such as JPG, HTML, and PlantUML as a unified dataset—is unique among AI platforms.

ChatGPT continues to excel in coding with stable performance. It never disappoints. ChatGPT is all in one. Copilot runs locally on my PC and performs well, but due to its free edition limitations, it cannot handle large text inputs or process PDF files for me. For me, essential performance is not about speed but the quality in solving problems.

I tested DeepSeek in my shell (as shown in the following screenshot), and it successfully answered a physics question perfectly. However, I have not used DeepSeek extensively enough to fully evaluate its performance. Based on established performance data I reviewed, DeepSeek performs at least 10% lower than ChatGPT in essential fields and across many libraries. Its shortcomings in certain areas are an unforgivable flaw for me. My admiration for AI is not just about specialized models but rather the all-in-one capability of AI.

I am unsure about DeepSeek's performance in tasks requiring creativity or working with unlabeled data. While it may excel in structured tasks, its effectiveness in open-ended, creative scenarios depends on the specific use case and data type.


Screenshot: Deepseek answering prompts on shell

Pages