Overview
In this guide, we’ll use vLLM running inside a container to serve Granite models.Prerequisites
- A container runtime such as Docker Desktop or Podman
- An NVIDIA GPU with drivers installed
1. Pull the vLLM container image
vllm/vllm-openai:v0.10.2
. Granite models require vllm
version 0.10.2 and above.)
2. Run Granite in the container
Run the container with your Hugging Face cache mounted:You can pre-download the model into If not pre-downloaded, the model will be fetched automatically when vLLM starts and cached in the mounted directory.
~/.cache/huggingface
using:3. Run a sample request
4. Enabling tool calling and other extended capabilities
To run vllm with the Granite 4.0 models and enabling capabilities such as tool calling, use the additional parameters--tool-call-parser hermes
and --enable-auto-tool-choice
. Refer to the vLLM documentation here for more details on these and other parameters.
Now run the container with the added parameters:
docker run
command share in Section 2.