Distributed AI Models: Training and Inference Across Multiple Nodes

 Distributed Inference:

Imagine having a language model so large it can’t fit on a single computer. That’s where distributed inference comes in. It’s like dividing a big task among multiple workers. vLLM is a tool that helps break down these massive models and spread them across multiple GPUs or even entire machines, making it possible to work with them efficiently.

vLLM:

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantizations: GPTQAWQ, INT4, INT8, and FP8.
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.


How to decide the distributed inference strategy?

Before going into the details of distributed inference and serving, let’s first make it clear when to use distributed inference and what are the strategies available. The common practice is:

  • Single-Node Multi-GPU (tensor parallel inference): If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
  • Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference): If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.

Why do we need distributed training?

When it comes to deep learning, it is mostly about optimizing linear algebra and we all know that it is computationally expensive. The problem arises when we have a lot of training data which can be very common in deep learning, in such cases the training may take months, even years on a single machine with even a powerful accelerator. 
When we try to solve complex problems involving images, audio, or text, we use models with complex architectures to get better results. During training, these models may calculate and store millions or billions of updated weight parameters which can cause storage issues. Also, sometimes your machine might crash in between training, which would cause losing all progress and this is a quite frequent problem when the training period is very large as mentioned earlier.

Comments

Popular posts from this blog

Hyper-Scalable Edge Mesh Networks for IoT Applications

Revolutionizing Supply Chain Management with Blockchain