While trying to build a RAG pipeline for some documents, I had to solve the problem of running vLLM on the CPU.
By default, vLLM assumes that you run with a GPU, and there are prebuilt images (docs) at https://hub.docker.com/r/vllm/vllm-openai making testing and deployment easy.
In the case of running it on CPU, there are also prebuilt images (docs) at https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo. However, the default setting assumes that avx512f is available.
Since my compute server runs on i9-14900k which has no support for AVX-512 instructions, the only solution here is to build an image that does NOT assume AVX-512 support following the instructions to build image from source.
Here, the Dockerfile is a little bit finicky; for example, building from a Git submodule doesn’t work 1, and the image must be built in a clone of the repository.
To prevent having to build the image every time, I have automated the building of the image in southball/vllm-avx2-docker so the CPU-only, no-AVX-512-assumed image can be used by pulling ghcr.io/southball/vllm-avx2-docker (if you trust it).
vLLM with AVX2 only performs reasonably well: a quick benchmark with
oha -z 10s 'http://localhost:8000/v1/embeddings' -d '{"model":"Qwen/Qwen3-Embedding-0.6B","input":"Test?"}' -H 'Content-Type: application/json' -m POST -c 32reported about 80 requests/second 2 handled, while performing the same test with Ollama on RTX 3060 yielded about 120 requests/second.
vLLM powered by OpenVINO (vllm-project/vllm-openvino) might perform better; however, it seems to not support running embedding models - when tested, the same error as in vllm-project/vllm#13287 is produced.
Although it’s a bit hard to get working, vLLM runs perfectly fun on the CPU, so if you have to do so hopefully this is a useful reference for you.
The i9-14900k in the server is throttled to 65W, so the performance should be lower than a normal i9-14900k. ↩