View model cards
View model details on Hugging Face
Try our Granite agents
Leverage Granite to search the web and write research reports
Experiment with Granite locally
Download and run Granite locally with Ollama
Get cooking with Granite!
Check out these recipes to start using Granite
Overview
Granite 4.0 introduces a hybrid Mamba-2/transformer architecture, with a Mixture-of-Experts (MoE) strategy in select models, delivering more than 70% lower memory requirements and 2x faster inference speeds compared to similar models, particularly in multi-session and long-context scenarios. The models deliver strong performance across benchmarks, with Granite 4.0 Small achieving industry-leading results in key agentic tasks like instruction following and function calling. These efficiencies make the models well-suited for a wide range of use cases like retrieval-augmented generation (RAG), multi-agent workflows, and edge deployments. Granite 4.0 is released under Apache 2.0, cryptographically signed for authenticity, and the first open model family certified under ISO 42001. Granite 4.0 Release BlogGranite 4.0 Nano Release Blog
| Model | Architecture Type | Model Size | Intended Use |
|---|---|---|---|
Granite-4.0-H-Small ![]() | Hybrid, Mixture of Experts | 32B total / 9B activated | Workhorse model for key enterprise tasks like RAG and agents |
| Granite-4.0-H-Tiny | Hybrid, Mixture of Experts | 7B total / 1B activated | Designed for low latency and local applications, particularly where the task has long prefill or other scenarios where a MoE model is desired |
| Granite-4.0-H-Micro | Hybrid, Dense | 3B total | Designed for low latency and local applications, and as a building block to perform key tasks (like function calling) quickly within agentic workflows |
| Granite-4.0-Micro | Traditional, Dense | 3B total | Alternative option for users when Mamba2 support is not yet optimized (e.g. llama.cpp, PEFT, etc) |
| Granite-4.0-H-1B | Dense, Hybrid | 1.5B | Ideal models for edge, on-device, and latency-sensitive use cases |
| Granite-4.0-1B | Dense, Traditional | 1B | Alternative option for users when Mamba2 support is not yet optimized (e.g. llama.cpp, PEFT, etc) |
| Granite-4.0-H-350M | Dense, Hybrid | 350M | Similar use cases as Granite-4.0-H-1B, but even smaller and cheaper to run |
| Granite-4.0-350M | Dense, Traditional | 350M | Alternative option for users when Mamba2 support is not yet optimized (e.g. llama.cpp, PEFT, etc) |
Inference Examples
Basic Inference
The Granite 4 models work best with temperature set to 0 for most inferencing
tasks.
Tool calling
- When a list of tools is supplied, the chat template automatically formats this list as a system prompt. See the Chat Template Design section for more information about system prompt design in tool-calling scenarios.
- Please follow OpenAI’s function definition schema to define tools.
- Granite 4.0 chat template automatically returns tool calls between
<tool_call>and</tool_call>tags within the assistant turn. Refer to the Chat Template Design section for examples. - The Granite 4.0 chat template converts content from
toolrole into a user role, where tool responses appear between<tool_response>and</tool_response>tags. This conversion occurs automatically when using libraries that apply the chat template for you. However, users who build prompts directly inJinjamust ensure that tool responses are formatted according to the Granite 4.0 chat template. Please, refer to the Chat Template Design section for examples.
RAG
- The chat template lists documents as part of the
systemturn between<documents>and</documents>tags. Please, refert to Chat Template Design section for further details. - Documents are provided to the model as a list of dictionaries, with each dictionary representing a document. We recommend formatting documents as follows:
FIM
- The tags supported for fill-in-the-middle (FIM) code completion are:
<|fim_prefix|>,<|fim_middle|>, and<|fim_suffix|>. Make sure to use the correct FIM tags when using Granite 4.0 models for FIM code completions. - Prepend code before the missing part with the tag
<|fim_prefix|> - Prepend code after the missing part with the tag
<|fim_suffix|> - End your prompt with the tag
<|fim_middle|>to indicate to the model something is missing in the code snippet. - Completion of basic programming concepts (e.g., function, method, conditionals, loops) is covered for various programming languages (e.g., python, c/c++, go, java).
JSON output
Backward Compatibility
Prompt templates designed for basic inference tasks should remain compatible with Granite 4.0 models. However, more complex templates may require a migration process. To take full advantage of Granite 4.0 models, please refer to this prompt engineering guide to create new templates or migrating existing ones. Some points to consider while migrating templates are:- The chat template in this version handles lists of tools and documents differently. This update is applied automatically when using the
apply_chat_templatefunction from the transformers library to construct prompts. However, users who craftJinjatemplates manually must ensure they follow the updated chat template format. Please refer to the Chat Template Design section for more details. - Controls for
lengthandoriginalityhave been deprecated, they are no longer supported in this version.
