View model cards
View model details on Hugging Face
Try our Granite Agents
Leverage Granite to search the web and write research reports
Experiment with Granite Locally
Download and run Granite locally with Ollama
Get cooking with Granite!
Check out these recipes to start using Granite
Overview
Granite 4.0 introduces a hybrid Mamba-2/transformer architecture, with a Mixture-of-Experts (MoE) strategy in select models, delivering more than 70% lower memory requirements and 2x faster inference speeds compared to similar models, particularly in multi-session and long-context scenarios. The models deliver strong performance across benchmarks, with Granite 4.0 Small achieving industry-leading results in key agentic tasks like instruction following and function calling. These efficiencies make the models well-suited for a wide range of use cases like retrieval-augmented generation (RAG), multi-agent workflows, and edge deployments. Granite 4.0 is released under Apache 2.0, cryptographically signed for authenticity, and the first open model family certified under ISO 42001. Granite 4.0 Release Blog![]() | ![]() Granite-H-Tiny | ![]() Granite-H-Micro | ![]() Granite-Micro | |
---|---|---|---|---|
Architecture Type | Hybrid, Mixture of Experts | Hybrid, Mixture of Experts | Hybrid, Dense | Traditional, Dense |
Model Size | 32B total parameters 9B activated parameters | 7B total parameters 1B activated parameters | 3B total parameters | 3B total parameters |
Intended Use | Workhorse model for key enterprise tasks like RAG and agents | Designed for low latency, edge, and local applications, and as a building block to perform key tasks (like function calling) quickly within agentic workflows | Designed for low latency, edge, and local applications, and as a building block to perform key tasks (like function calling) quickly within agentic workflows | Alternative option for users when Mamba2 support is not yet optimized (e.g. llama.cpp, PEFT, etc) |
Memory Requirements (8-bit) | 33 GB | 8 GB | 4 GB | 9 GB |
Example Hardware | NVIDIA L4OS | RTX 3060 12GB | Raspberry-Pi 8GB | RTX 3060 12GB |
Model deployment specifications:
8-bit
, 128K context length
, and batch=1
.Inference Examples
Basic Inference
The Granite 4 models work best with temperature set to 0 for most inferencing tasks.
Tool calling
- When a list of tools is supplied, the chat template automatically formats this list as a system prompt. See the Chat Template Design section for more information about system prompt design in tool-calling scenarios.
- Please follow OpenAI’s function definition schema to define tools.
- Granite 4.0 chat template automatically returns tool calls between
<tool_call>
and</tool_call>
tags within the assistant turn. Refer to the Chat Template Design section for examples. - The Granite 4.0 chat template converts content from
tool
role into a user role, where tool responses appear between<tool_response>
and</tool_response>
tags. This conversion occurs automatically when using libraries that apply the chat template for you. However, users who build prompts directly inJinja
must ensure that tool responses are formatted according to the Granite 4.0 chat template. Please, refer to the Chat Template Design section for examples.
RAG
- The chat template lists documents as part of the
system
turn between<documents>
and</documents>
tags. Please, refert to Chat Template Design section for further details. - Documents are provided to the model as a list of dictionaries, with each dictionary representing a document. We recommend formatting documents as follows:
FIM
- The tags supported for fill-in-the-middle (FIM) code completion are:
<|fim_prefix|>
,<|fim_middle|>
, and<|fim_suffix|>
. Make sure to use the correct FIM tags when using Granite 4.0 models for FIM code completions. - Prepend code before the missing part with the tag
<|fim_prefix|>
- Prepend code after the missing part with the tag
<|fim_suffix|>
- End your prompt with the tag
<|fim_middle|>
to indicate to the model something is missing in the code snippet. - Completion of basic programming concepts (e.g., function, method, conditionals, loops) is covered for various programming languages (e.g., python, c/c++, go, java).
JSON output
Backward Compatibility
Prompt templates designed for basic inference tasks should remain compatible with Granite 4.0 models. However, more complex templates may require a migration process. To take full advantage of Granite 4.0 models, please refer to this prompt engineering guide to create new templates or migrating existing ones. Some points to consider while migrating templates are:- The chat template in this version handles lists of tools and documents differently. This update is applied automatically when using the
apply_chat_template
function from the transformers library to construct prompts. However, users who craftJinja
templates manually must ensure they follow the updated chat template format. Please refer to the Chat Template Design section for more details. - Controls for
length
andoriginality
have been deprecated, they are no longer supported in this version.