Skip to main content

Overview

Granite 4.0 introduces a hybrid Mamba-2/transformer architecture, with a Mixture-of-Experts (MoE) strategy in select models, delivering more than 70% lower memory requirements and 2x faster inference speeds compared to similar models, particularly in multi-session and long-context scenarios. The models deliver strong performance across benchmarks, with Granite 4.0 Small achieving industry-leading results in key agentic tasks like instruction following and function calling. These efficiencies make the models well-suited for a wide range of use cases like retrieval-augmented generation (RAG), multi-agent workflows, and edge deployments. Granite 4.0 is released under Apache 2.0, cryptographically signed for authenticity, and the first open model family certified under ISO 42001. Granite 4.0 Release Blog

Granite SmallGranite-H-SmallGranite Tiny
Granite-H-Tiny
Granite Micro
Granite-H-Micro
Granite Micro
Granite-Micro
Architecture TypeHybrid, Mixture of ExpertsHybrid, Mixture of ExpertsHybrid, DenseTraditional, Dense
Model Size32B total parameters
9B activated parameters
7B total parameters
1B activated parameters
3B total parameters3B total parameters
Intended UseWorkhorse model for key enterprise tasks like RAG and agentsDesigned for low latency, edge, and local applications, and as a building block to perform key tasks (like function calling) quickly within agentic workflowsDesigned for low latency, edge, and local applications, and as a building block to perform key tasks (like function calling) quickly within agentic workflowsAlternative option for users when Mamba2 support is not yet optimized (e.g. llama.cpp, PEFT, etc)
Memory Requirements (8-bit)33 GB8 GB4 GB9 GB
Example HardwareNVIDIA L4OSRTX 3060 12GBRaspberry-Pi 8GBRTX 3060 12GB
Model deployment specifications: 8-bit, 128K context length, and batch=1.

Inference Examples

Basic Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_path = "ibm-granite/granite-4.0-h-tiny" 
tokenizer = AutoTokenizer.from_pretrained(model_path)

# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

# change input text as desired
chat = [
{ "role": "user", "content": "What is the largest ocean on Earth?"},
]

chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, 
                    max_new_tokens=150, temperature=0)
# decode output tokens into text
output = tokenizer.batch_decode(output)

print(output[0])
The Granite 4 models work best with temperature set to 0 for most inferencing tasks.

Tool calling

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
# model_path = ""
model_path = "ibm-granite/granite-4.0-micro"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

chat=[
    {"role": "user", "content": "What's the current weather in New York?"},
    {"role": "assistant", 
     "content": "", 
     "tool_calls": [
         {
             "function": {
                 "name": "get_current_weather",
                 "arguments": {"city": "New York"}
             }
         }
     ]
    },
    {"role": "tool", "content": "New York is sunny with a temperature of 30°C."},
    {"role": "user", "content": "OK, Now tell me what's the weather like in Bengaluru at this moment?"}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location" : {
                        "description": "The city and state, e.g. San Francisco, CA",
                        "type": "string",
                        },
                },
                "required" : ["location"]
            }
        } 
    },
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Retrieves the current stock price for a given ticker symbol. The ticker symbol must be a valid symbol for a publicly traded company on a major US stock exchange like NYSE or NASDAQ. The tool will return the latest trade price in USD. It should be used when the user asks about the current or most recent price of a specific stock. It will not provide any other information about the stock or company.",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticker" : {
                        "description": "The stock ticker symbol, e.g. AAPL for Apple Inc.",
                        "type": "string",
                        },
                }
            }
        } 
    }
]
    
chat = tokenizer.apply_chat_template(chat,tokenize=False, add_generation_prompt=True, tools=tools)

# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=100, temperature=0)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output[0])
Consider the following points to build promts for tool-calling tasks:
  • When a list of tools is supplied, the chat template automatically formats this list as a system prompt. See the Chat Template Design section for more information about system prompt design in tool-calling scenarios.
  • Please follow OpenAI’s function definition schema to define tools.
  • Granite 4.0 chat template automatically returns tool calls between <tool_call> and </tool_call> tags within the assistant turn. Refer to the Chat Template Design section for examples.
  • The Granite 4.0 chat template converts content from tool role into a user role, where tool responses appear between <tool_response> and </tool_response> tags. This conversion occurs automatically when using libraries that apply the chat template for you. However, users who build prompts directly in Jinja must ensure that tool responses are formatted according to the Granite 4.0 chat template. Please, refer to the Chat Template Design section for examples.

RAG

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_path = "ibm-granite/granite-4.0-h-tiny" 

tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

chat=[
    {"role": "user", "content": "Could you please tell me what the first Bridget Jones's movie is about?, please be brief in your response."}
]

documents=[
    {
        "doc_id": 1,
        "title": "Bridget Jones: The Edge of Reason (2004)",
        "text": "Bridget Jones: The Edge of Reason (2004) - Bridget is currently living a happy life with her lawyer boyfriend Mark Darcy, however not only does she start to become threatened and jealous of Mark's new young intern, she is angered by the fact Mark is a Conservative voter. With so many issues already at hand, things get worse for Bridget as her ex-lover, Daniel Cleaver, re-enters her life; the only help she has are her friends and her reliable diary.",
        "source": ""
    },
    {
        "doc_id": 2,
        "title": "Bridget Jones's Baby (2016)",
        "text": "Bridget Jones's Baby (2016) - Bridget Jones is struggling with her current state of life, including her break up with her love Mark Darcy. As she pushes forward and works hard to find fulfilment in her life seems to do wonders until she meets a dashing and handsome American named Jack Quant. Things from then on go great, until she discovers that she is pregnant but the biggest twist of all, she does not know if Mark or Jack is the father of her child.",
        "source": ""
    },
    {
        "doc_id": 3,
        "title": "Bridget Jones's Diary (2001)",
        "text": "Bridget Jones's Diary (2001) - Bridget Jones is a binge drinking and chain smoking thirty-something British woman trying to keep her love life in order while also dealing with her job as a publisher. When she attends a Christmas party with her parents, they try to set her up with their neighbours' son, Mark. After being snubbed by Mark, she starts to fall for her boss Daniel, a handsome man who begins to send her suggestive e-mails that leads to a dinner date. Daniel reveals that he and Mark attended college together, in that time Mark had an affair with his fiancée. Bridget decides to get a new job as a TV presenter after finding Daniel being frisky with a colleague. At a dinner party, she runs into Mark who expresses his affection for her, Daniel claims he wants Bridget back, the two fight over her and Bridget must make a decision who she wants to be with.",
        "source": ""
    },
]

chat = tokenizer.apply_chat_template(chat,tokenize=False, add_generation_prompt=True, documents=documents)

# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=800, temperature=0)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output[0])
Consider the following points to build promts for RAG tasks:
  • The chat template lists documents as part of the system turn between <documents> and </documents> tags. Please, refert to Chat Template Design section for further details.
  • Documents are provided to the model as a list of dictionaries, with each dictionary representing a document. We recommend formatting documents as follows:
{
"doc_id": 1,
"title": "History Document Title",
"text": "From the early 12th century, French builders developed the Gothic style, marked by the use of rib vaults, pointed arches, flying buttresses, and large stained glass windows. It was used mainly in churches and cathedrals, and continued in use until the 16th century in much of Europe. Classic examples of Gothic architecture include Chartres Cathedral and Reims Cathedral in France as well as Salisbury Cathedral in England. Stained glass became a crucial element in the design of churches, which continued to use extensive wall-paintings, now almost all lost.",
"source": ""
}

FIM

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
# model_path = ""
model_path = "ibm-granite/granite-4.0-micro"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

prompt = """<|fim_prefix|>
def fibonacci(n):
    result =
<|fim_suffix|>
    return result
<|fim_middle|>"""

chat = [
    { "role": "user", "content": prompt},
]

chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=50, temperature=0)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output[0])
Consider the following points to build prompts for FIM coding tasks:
  • The tags supported for fill-in-the-middle (FIM) code completion are: <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|>. Make sure to use the correct FIM tags when using Granite 4.0 models for FIM code completions.
  • Prepend code before the missing part with the tag <|fim_prefix|>
  • Prepend code after the missing part with the tag <|fim_suffix|>
  • End your prompt with the tag <|fim_middle|> to indicate to the model something is missing in the code snippet.
  • Completion of basic programming concepts (e.g., function, method, conditionals, loops) is covered for various programming languages (e.g., python, c/c++, go, java).

JSON output

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_path = "ibm-granite/granite-4.0-h-tiny" 

tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

chat = [
    {
      "role": "system",
      "content": "You are a helpful assistant that answers in JSON. Here's the json schema you must adhere to:\n<schema>\n{\"title\": \"LeisureActivityBooking\", \"type\": \"object\", \"properties\": {\"activityID\": {\"title\": \"Activity ID\", \"type\": \"string\"}, \"participantInfo\": {\"title\": \"Participant Info\", \"type\": \"object\", \"properties\": {\"participantName\": {\"title\": \"Participant Name\", \"type\": \"string\"}, \"age\": {\"title\": \"Age\", \"type\": \"integer\"}}, \"required\": [\"participantName\", \"age\"]}, \"activityDate\": {\"title\": \"Activity Date\", \"type\": \"string\", \"format\": \"date\"}, \"equipmentNeeded\": {\"title\": \"Equipment Needed\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"activityID\", \"participantInfo\", \"activityDate\"]}\n</schema>\n"
    },
    {
      "role": "user",
      "content": "I'm planning a kayaking trip for my friend's birthday on April 15th, and I need to book it through your leisure activity service. The activity will be for my friend, Jamie Patterson, who is 27 years old. We'll need two kayaks, paddles, and safety vests. The activity ID for this booking is 'KAYAK-0423'. The date we're looking to book the kayaking activity is specifically on the 15th of April, 2023. To summarize, the equipment needed for this adventure includes two kayaks, a pair of paddles, and two safety vests to ensure a safe and enjoyable experience on the water."
    }
  ]

chat = tokenizer.apply_chat_template(chat,tokenize=False, add_generation_prompt=True)

# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=200, temperature=0)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output[0])

Backward Compatibility

Prompt templates designed for basic inference tasks should remain compatible with Granite 4.0 models. However, more complex templates may require a migration process. To take full advantage of Granite 4.0 models, please refer to this prompt engineering guide to create new templates or migrating existing ones. Some points to consider while migrating templates are:
  • The chat template in this version handles lists of tools and documents differently. This update is applied automatically when using the apply_chat_template function from the transformers library to construct prompts. However, users who craft Jinja templates manually must ensure they follow the updated chat template format. Please refer to the Chat Template Design section for more details.
  • Controls for length and originality have been deprecated, they are no longer supported in this version.
I