vLLM (Containerized)

Overview

In this guide, we’ll use vLLM running inside a container to serve Granite models.

Prerequisites

A container runtime such as Docker Desktop or Podman
An NVIDIA GPU with drivers installed

1. Pull the vLLM container image

docker pull vllm/vllm-openai:latest

(For stability, you may pin a version tag, e.g., vllm/vllm-openai:v0.10.2. Granite models require vllm version 0.10.2 and above.)

2. Run Granite in the container

Run the container with your Hugging Face cache mounted:

docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000     vllm/vllm-openai:latest     --model ibm-granite/granite-4.0-h-small

You can pre-download the model into ~/.cache/huggingface using:

huggingface-cli download ibm-granite/granite-4.0-h-small

If not pre-downloaded, the model will be fetched automatically when vLLM starts and cached in the mounted directory.

3. Run a sample request

curl -X POST http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
        "model": "ibm-granite/granite-4.0-h-small",
        "messages": [
          {"role": "user", "content": "How are you today?"}
        ]
      }'

4. Enabling tool calling and other extended capabilities

To run vllm with the Granite 4.0 models and enabling capabilities such as tool calling, use the additional parameters --tool-call-parser hermes and --enable-auto-tool-choice. Refer to the vLLM documentation here for more details on these and other parameters. Now run the container with the added parameters:

docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000     vllm/vllm-openai:latest     --model ibm-granite/granite-4.0-h-small --tool-call-parser hermes --enable-auto-tool-choice

Once the container is up, you can start running requests using the OpenAI API. Refer to the documentation on OpenAI API tool calling for examples. To run vllm with the Granite 3 models and tool calling, use the additional parameters specified in the vLLM documentation here as part of the docker run command share in Section 2.

Models

Run Granite

Responsible AI

Overview

Prerequisites

1. Pull the vLLM container image

2. Run Granite in the container

3. Run a sample request

4. Enabling tool calling and other extended capabilities

Models

Run Granite

Responsible AI

​Overview

​Prerequisites

​1. Pull the vLLM container image

​2. Run Granite in the container

​3. Run a sample request

​4. Enabling tool calling and other extended capabilities

Overview

Prerequisites

1. Pull the vLLM container image

2. Run Granite in the container

3. Run a sample request

4. Enabling tool calling and other extended capabilities