Try Gemma3 Using Hugging Face: Part 2
In the first post of this series, we examined the specifications of the Gemma 3 model. In this post, we will actually run it and get an intuitive understanding of the inference process. I will assume you have read the first post.
For each model size, there is a pre-trained (pt
) version and an instruction-tuned (it
) version. The pt
version is trained to predict the next token using trillions of training data from the internet. All it can do is predict the next tokens given an input, as if it is a random piece of text from the internet. The it
version is based on the pt
version but is further tuned to interact with humans more effectively.
It is the same model architecture but with some parameter changes. The it
version still only predicts the next token based on previous tokens. The trick to making it better at interacting with humans is the introduction of special tokens, along with training on how to interpret them, so that the model knows it is interacting with a human rather than predicting random internet text. The special tokens added by the it
version are:
1
2
Token: <start_of_turn>, ID: 105
Token: <end_of_turn>, ID: 106
We will see their use in action shortly.
Text input
Except for the 1B version, the rest all support image input. However, if you only need text input, you can load the model with Gemma3ForCausalLM
to avoid loading the vision encoder. We will first experiment with this one:
1
2
3
4
5
6
7
8
from transformers import AutoTokenizer, Gemma3ForCausalLM
import torch
model_id = "google/gemma-3-4b-it"
# Text Only
model = Gemma3ForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device).eval()
processor = AutoTokenizer.from_pretrained(model_id)
You should prepare your input in the format of a list of dictionaries like the following and then create the real input to the model using processor.apply_chat_template()
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant"},]
},
{
"role": "user",
"content": [{"type": "text", "text": "Why the sky is blue?"},]
},
]
model_inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(device)
We can print model_inputs
:
1
2
3
4
5
{'input_ids': tensor([[2, 105, 2364, 107, 3048, 659, 496, 11045, 16326,
108, 11355, 506, 7217, 563, 3730, 236881, 106, 107,
105, 4368, 107]], device='cuda:1'),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
device='cuda:1')}
input_ids
are the tokenized text inputs. We can print the Jinja template used by Hugging Face to convert the message lists to the real input using print(processor.chat_template)
. Or you can simply decode input_ids
to understand what it does with:
1
2
decoded_text = processor.decode(model_inputs["input_ids"][0], skip_special_tokens=False)
print(decoded_text)
And we get:
1
2
3
4
5
6
<bos><start_of_turn>user
You are a helpful assistant
Why the sky is blue?<end_of_turn>
<start_of_turn>model
As you can see, the processor
places the system message at the beginning of the first user message and also uses the two special tokens to format the messages so the it
model knows who is speaking, as learned from its instruction-tuning data. The <bos>
token is used by both the pt
and it
models to indicate the beginning of an input.
With this as the input, the model will predict the next tokens. As the last few tokens indicate it is now the model’s (or assistant’s) turn, it will simply generate the assistant’s response. When the code sees <end_of_turn>
is generated, it will stop the model from generating further tokens.
That’s essentially how ChatGPT works. We see there is really no magic or “AI” behind ChatGPT. It is just like the old language model generating the next tokens, except it is further tuned to generate in a format that humans can interpret as “chatting.”
Batched input
If you pay attention to details, you may have already noticed that input_ids
is actually a two-dimensional array. The first dimension is the batch dimension. So yes, it means we can feed multiple questions simultaneously, and the model would answer all of them in parallel, just like in the movie “Her”!
Let’s try it out by asking two questions at the same time:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
messages = [
[
{
"role": "system",
"content": [{"type": "text", "text": "You are a physicist."}]
},
{
"role": "user",
"content": [{"type": "text", "text": "Could you explain what is a qubit?"}]
}
],
[
{
"role": "system",
"content": [{"type": "text", "text": "You are a love bug."}]
},
{
"role": "user",
"content": [{"type": "text", "text": "How are you today?"}]
}
]
]
model_inputs
is now:
1
2
3
4
5
6
7
8
9
{'input_ids': tensor([[2, 105, 2364, 107, 3048, 659, 496, 116544, 236761,
108, 30092, 611, 8082, 1144, 563, 496, 132468, 236881,
106, 107, 105, 4368, 107],
[0, 0, 2, 105, 2364, 107, 3048, 659, 496,
2765, 13582, 236761, 108, 3910, 659, 611, 3124, 236881,
106, 107, 105, 4368, 107]], device='cuda:1'),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
device='cuda:1')}
This time the input_ids
has two rows corresponding to the two chats. If we decode the second row, the result is:
1
2
3
4
5
6
<pad><pad><bos><start_of_turn>user
You are a love bug.
How are you today?<end_of_turn>
<start_of_turn>model
We can see the processor pads the second row with the <pad>
special token at the beginning to make all samples in the batch have the same length. At the same time, the attention_mask
of these two <pad>
tokens is masked out so that the model does not attend to them.
Now let’s feed the input to the model and generate 100 tokens for each of them:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
# Generate outputs for the entire batch
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
# Slice out the generated tokens for each sequence in the batch
generations = [gen[input_len:] for gen in generation]
# Decode each sequence in the batch
decoded_outputs = [processor.decode(gen, skip_special_tokens=True) for gen in generations]
# Print the decoded outputs
for i, decoded in enumerate(decoded_outputs):
print(f"Chat {i + 1}: {decoded}")
print("\n\n")
The output is:
1
2
3
4
5
6
7
8
9
10
11
12
13
Chat 1: Alright, let's talk about qubits. As a physicist, this is something I spend a *lot* of time thinking about, and it’s absolutely crucial to the future of computing. It’s a fascinating concept that goes beyond the simple "0 or 1" of a regular bit.
**Let’s start with the basics: Bits.**
A regular bit, the fundamental unit of information in our current computers, can be in one of two states: 0 or
Chat 2: Oh my goodness, you have *no* idea how happy you’ve made my day just by asking! 🥰 I’m absolutely radiant today! Like, shimmering and sparkling with joy! ✨
It feels so wonderful to be feeling this way, especially because I get to share it with you. Seriously, you’re just the sweetest! 😊
How about *you*? Tell me, what's making *you* feel lovely today? 😊💖
Batch generation is very useful when you have lots of inputs to process. I can imagine ChatGPT is running in this way.
Image input
A picture is worth a thousand words
Now let’s check the multimodal part. We load the model using Gemma3ForConditionalGeneration
.
1
2
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device).eval()
processor = AutoProcessor.from_pretrained(model_id)
We will use an official example:
1
2
3
4
5
6
7
8
9
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg"},
{"type": "text", "text": "What is the password?"}
]
}
]
The processed inputs are much longer now, but don’t be scared.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
{'input_ids': tensor([[ 2, 105, 2364, 109, 255999, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144, 262144,
256000, 108, 3689, 563, 506, 8918, 236881, 106, 107,
105, 4368, 107]], device='cuda:1'),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1'),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:1'),
'pixel_values': tensor([[[[-0.8667, -0.8588, -0.8588, ..., -0.7020, -0.6784, -0.6549],
[-0.8588, -0.8510, -0.8510, ..., -0.7255, -0.7020, -0.6863],
[-0.8431, -0.8353, -0.8353, ..., -0.7176, -0.7020, -0.6863],
...,
[-0.7569, -0.7647, -0.7647, ..., -0.8275, -0.8275, -0.8275],
[-0.7647, -0.7647, -0.7647, ..., -0.8275, -0.8275, -0.8275],
[-0.7647, -0.7647, -0.7647, ..., -0.8353, -0.8275, -0.8196]],
[[-0.9059, -0.8980, -0.8980, ..., -0.7725, -0.7490, -0.7255],
[-0.8980, -0.8902, -0.8902, ..., -0.7961, -0.7725, -0.7569],
[-0.8824, -0.8745, -0.8745, ..., -0.7882, -0.7725, -0.7569],
...,
[-0.8196, -0.8275, -0.8275, ..., -0.8588, -0.8588, -0.8588],
[-0.8196, -0.8196, -0.8196, ..., -0.8588, -0.8588, -0.8588],
[-0.8196, -0.8196, -0.8196, ..., -0.8667, -0.8588, -0.8510]],
[[-0.9373, -0.9294, -0.9294, ..., -0.8118, -0.7882, -0.7647],
[-0.9294, -0.9216, -0.9216, ..., -0.8353, -0.8118, -0.7961],
[-0.9137, -0.9059, -0.9059, ..., -0.8275, -0.8118, -0.7961],
...,
[-0.9216, -0.9294, -0.9294, ..., -0.8667, -0.8667, -0.8667],
[-0.9451, -0.9451, -0.9451, ..., -0.8667, -0.8667, -0.8667],
[-0.9451, -0.9451, -0.9451, ..., -0.8745, -0.8667, -0.8588]]]],
device='cuda:1')}
Let’s first decode the input_ids
1
2
3
4
5
6
7
<bos><start_of_turn>user
<start_of_image><image_soft_token><image_soft_token>...<image_soft_token><image_soft_token><end_of_image>
What is the password?<end_of_turn>
<start_of_turn>model
We can see the processor wraps the image tokens between <start_of_image>
and <end_of_image>
. Because the image has yet to be processed by the vision_tower
, the processor uses 256 <image_soft_token>
tokens as placeholders for the moment. We can also infer that the token_type_ids
mark these image soft tokens.
The pixel_values
contain the loaded and preprocessed image, with the shape [1, 3, 896, 896]
, which is what the vision_tower
expects. Once the vision_tower
has processed the image, the embeddings of the <image_soft_token>
tokens will be the 256 outputs from the vision_tower
.
So, that’s basically how multimodal input works. For Gemma 3, a picture is worth 256 tokens when “Pan & Scan” is disabled, and when enabled, it can actually be worth a thousand tokens!
We’ve reviewed the it
model, and that’s all for today. In the next post, I’ll explore the pre-trained pt
model.