基于vLLM本地部署企业级DeepSeek-R1

1.vLLM

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架,旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库,用于 LLM 推理和服务,可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法「PagedAttention」,有效地管理注意力键和值。

2.演示环境

../../_images/2025-03-11_144132.png

2.1 环境设置

2.1.1 install miniconda

Installing Miniconda - Anaconda

mkdir ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3

../../_images/2025-03-11_145436.png

2.1.1 激活miniconda

~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
source ~/.bashrc
source ~/.zshrc

../../_images/2025-03-11_152026.png

../../_images/2025-03-11_152207.png

2.1.2 修改镜像源

vim ~/miniconda3/.condarc

../../_images/2025-03-11_160314.png

show_channel_urls: true
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

2.1.3 创建conda虚拟环境

conda create --name vLLM python==3.10 -y
conda activate vLLM

../../_images/2025-03-11_164411.png

2.1.4 安装驱动

sudo apt update
sudo apt upgrade -y
sudo apt install -y build-essential dkms
sudo update-initramfs -u

../../_images/2025-03-11_165219.png

NVIDIA GeForce 驱动程序 - N 卡驱动 | NVIDIA

../../_images/2025-03-11_165713.png

../../_images/2025-03-11_170509.png

../../_images/2025-03-11_170652.png

reboot
conda activate vLLM
nvcc --version ## check the cuda version
nvidia-smi

../../_images/2025-03-11_171500.png

CUDA Toolkit Archive | NVIDIA Developer

../../_images/2025-03-11_171839.png

../../_images/2025-03-11_172002.png

../../_images/2025-03-11_172431.png

sudo vim ~/.bashrc

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.4/lib64
export PATH=$PATH:/usr/local/cuda-12.4/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-12.4

source ~/.bashrc

../../_images/2025-03-11_172906.png

../../_images/2025-03-11_173321.png

download.pytorch.org/whl/torch/

../../_images/2025-03-11_174452.png

../../_images/2025-03-11_175221.png

../../_images/2025-03-11_175411.png

../../_images/2025-03-11_175601.png

2.2 部署模型

2.2.1 下载模型方式1

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B at main

../../_images/2025-03-12_091422.png

2.2.2 下载模型方式2

conda activate vLLM

pip install modelscope
sudo mkdir -p /data/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local_dir /data/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

2.2.3 运行

conda activate vLLM

CUDA_VISIBLE_DEVICES=0 vllm serve /data/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --tensor-parallel-size 1 --max-model-len 32768 --enforce-eager

参考:

【保姆级教程4】基于vLLM本地部署企业级DeepSee-R1,30分钟手把手教学,小白_码农皆宜!附 - 4_哔哩哔哩_bilibili