Running Large Language Models (LLM) on Different Hardware -- Performance Comparison and Analysis

This article delves into the performance differences of running Large Language Models (LLM) on various hardware, ranging from low-cost Raspberry Pi to high-end AI workstations. By analyzing actual test data, we'll understand how hardware choices affect LLM inference speed and overall usability.
Low-End Hardware: Raspberry Pi
- Running LLaMA 3.1 on Raspberry Pi 4 (8GB RAM) is feasible, but practicality is very limited.
- Since Raspberry Pi has no GPU, the model relies entirely on CPU for computation, resulting in extremely slow model loading and inference speed—only about one word per second.
- When running LLaMA 3.1, Raspberry Pi 4's CPU usage reaches 100%, temperature rises, and memory usage is about 6GB.
- This performance obviously cannot meet real-time interaction requirements, resulting in very poor user experience.
Mid-Range Hardware: Mini PC
- Orion herk mini PC (Ryzen 9 7940HS, Radeon 780M GPU) provides a smoother experience.
- On herk, LLaMA 3.1 inference speed is comparable to ChatGPT, indicating it has certain practical value.
- However, despite herk being equipped with Radeon 780M GPU, due to its 6GB VRAM limitation, LLaMA 3.1 cannot be loaded into GPU and must rely on CPU for inference.
- Even testing the smaller LLaMA 3.2 model (2GB) couldn't use GPU for inference.
- This shows that even integrated GPUs need sufficient VRAM to accommodate LLM for efficient inference.
High-End Hardware: Gaming PC and Workstation
- Desktop computer with Nvidia 4080 GPU (Threadripper 3970X) performs excellently when running LLaMA 3.1.
- 4080's GPU utilization can reach 75% to 100%, inference speed is significantly faster than ChatGPT, user experience is smooth.
- This shows that discrete graphics cards have significant advantages when running large LLMs.
- Mac Pro equipped with M2 Ultra chip also shows powerful performance, with GPU utilization reaching 50% and very fast inference speed.
- This indicates Apple Silicon is also competitive in running LLMs.
Ultra High-End Hardware: AI Workstation
- 96-core Threadripper workstation with Nvidia 6000 Ada graphics card and 512GB RAM can run larger LLaMA 3.1 model (405 billion parameters).
- However, even on this powerful hardware, running such a massive model still results in extremely slow inference speed, similar to the Raspberry Pi experience.
- This shows that model size's impact on performance may be as important as hardware.
- When running the smaller and more efficient LLaMA 3.2 model (about 2GB) on this workstation, inference speed becomes extremely fast.
Conclusion
- Choosing the right hardware for LLM is crucial as it directly affects model performance and usability.
- While low-end hardware can run small LLMs, powerful GPU and ample memory are essential for large LLMs.
- Even with high-end hardware, model size significantly affects inference speed, so choosing models suitable for your needs is also crucial.
Hardware Performance Comparison
To more intuitively show performance differences across hardware platforms, we can create a simple table:
| Hardware Platform | CPU | GPU | Memory | LLaMA 3.1 Inference Speed | LLaMA 3.2 Inference Speed |
|---|---|---|---|---|---|
| Raspberry Pi 4 | 4-core | None | 8GB | Very slow (~1 word/sec) | Not tested |
| Orion herk | Ryzen 9 7940HS | Radeon 780M (6GB) | 32GB | Comparable to ChatGPT | Relatively fast |
| Threadripper 3970X | 32-core | Nvidia 4080 | 128GB | Faster than ChatGPT | Very fast |
| Mac Pro | M2 Ultra | Integrated GPU | 128GB | Very fast | Not tested |
| Threadripper (96-core) | 96-core | Nvidia 6000 Ada | 512GB | Very slow (405B parameter model) | Extremely fast |
Note: Inference speed descriptions in the table are relative; actual performance is affected by various factors including model version, software configuration, and test environment.
Recommendations for Future Hardware Choices
- If budget is limited and you only need to run small LLMs, a mini PC with integrated GPU is a good choice.
- If you need to run large LLMs or pursue higher performance, invest in discrete graphics cards and ample memory.
- For professional use, AI workstations offer the highest performance and flexibility, but at higher cost.