← Back to Blog

Fine-Tuned Model Outputs Garbage? Template Mismatch Debug Guide

Your model trained fine but inference fails? Learn how to catch chat template mismatches before wasting GPU hours on debugging the wrong thing.

TL;DR If your training data uses ChatML format but your inference setup uses Llama-2 formatting, the model outputs nonsense. You should check template consistency before training.

When fine-tuning LLMs, training often succeeds but inference produces garbage output. The model might repeat phrases, ignore prompts, or generate random tokens. And after checking datasets, hyperparameters, and checkpoints multiple times, the issue is usually a one-line config problem: chat template mismatch.

The model learns specific conversation markers during training.

If inference uses different markers, the model doesn’t recognize them. This happens because training data has one format like ChatML, Llama-2, Llama-3 while the inference library applies a different default format.

How Chat Templates Work

When fine-tuning a conversational model, training data needs structure to separate user messages from assistant responses. Different models use different markers.

For example, ChatML uses <|im_start|> and <|im_end|> tokens:

<|im_start|>user
What is RAG?<|im_end|>
<|im_start|>assistant
RAG is...<|im_end|>

Llama-2 uses [INST] and [/INST]:

<s>[INST] What is RAG? [/INST] RAG is...</s>

Llama-3 changed the format entirely:

<|start_header_id|>user<|end_header_id|>

What is RAG?<|eot_id|>

Mistral is similar to Llama-2 but simpler:

<s>[INST] What is RAG? [/INST]

On the other hand, Alpaca uses plain text markers:

### Instruction:
What is RAG?

### Response:

The model learns these specific patterns during training. If training data has one format and inference applies another, the model waits for markers that never appear or sees markers it never learned.

Common Scenarios

The most frequent case is when we download a dataset from HuggingFace using ChatML format, load a Llama base model defaulting to Llama-2 format, don’t explicitly set the chat template. Training runs fine because the model learns patterns from whatever data you provide. Hovewer, at inference, your library applies its default template and you get garbage output.

Another common issue is training setup and deployment setup use different tools. You train with HuggingFace Trainer, but deploy with vLLM or Ollama. Each tool has different defaults and assumptions about formatting.

Custom templates for specialized tasks create problems too. You create a custom format for your task, then deploy to standard inference servers that don’t know about it. The server applies whatever template it thinks is right based on the model name or config.

Template Mismatch During Training: Knowledge Loss

Template mismatch doesn’t just cause inference failures - it affects what the model learns during training.

When fine-tuning with the wrong template, the model can’t properly distinguish between conversation structure and actual content. It learns the pattern of your training data but may treat factual information as formatting noise.

This is why sometimes a fine-tuned model “sort of works” but gives wrong information. It learned conversation flow but lost the knowledge. The template confusion during training makes the model unable to separate structure from content.

HuggingFace’s research shows this affects learning effectiveness, not just output quality.

What Garbage Output Actually Looks Like

Let’s look at an example using a fine-tuned TinyLlama model trained on Kubernetes Q&A.

The model was trained using ChatML format but tested with Llama-2 format at inference to demonstrate the mismatch.

With correct ChatML template (same as training):

Question: What is a pod in Kubernetes?
Response: A pod is a self-contained unit of application that runs on a node. 
It contains containers and a volume.

The model understands the question and responds coherently.

With wrong Llama-2 template:

Response: [INST] What is a pod in Kubernetes? [/INST] What is a container 
in Kubernetes? [/INST] What is a deployment? [/INST] What is a replica set? 
[/INST] What is a service? [/INST] What is a cluster? [/INST] What is a 
namespace? [/INST] What is a statefulset? [/INST] What is a secret? [/INST]...

The model just repeats template markers [/INST]. It’s waiting for ChatML markers like <|im_start|> but seeing Llama-2’s [INST] instead.

From a code perspective, inference runs successfully. The output is just useless because the template doesn’t match what the model learned during training.

So how can we ensure we use matching templates?

Role Handling and Multi-Turn Conversations

Chat templates don’t just format messages - they manage role identification. Different formats handle user, assistant, and system roles differently.

Let’s test this with the same TinyLlama model. First, system role handling with correct ChatML:

System: You are a Kubernetes expert. Be concise.
User: What is a pod?
Response: A pod is a Kubernetes container that runs a single application.

The model follows the system instruction and gives a concise answer.

Now the same prompt with wrong Llama-2 format:

<<SYS>>You are a Kubernetes expert. Be concise.<</SYS>>
[INST] What is a pod? [/INST]
Response: <<SYS>>You are a Kubernetes expert. Be concise.<</SYS>>
[INST] What is a pod? [/INST]

The model just echoes back the input. It doesn’t understand that <<SYS>> is a system role marker because it was trained with ChatML’s <|system|> format. The system instruction becomes noise.

Multi-turn conversations show the same problem. With correct ChatML format:

Turn 1: What is a pod?
Response: A pod is the smallest deployable unit in Kubernetes.
Turn 2: Can a pod have multiple containers?
Response: Yes, a pod can have multiple containers. Each container runs a separate process.

The model maintains context and answers the follow-up question correctly.

With wrong Llama-2 format:

[INST] What is a pod? [/INST] A pod is the smallest deployable unit...
[INST] Can a pod have multiple containers? [/INST]
Response: [INST] What is a pod? [/INST]... [INST] Can a pod have multiple 
containers? [/INST] No. Pods are isolated by namespaces. [INST] How do you 
check if a pod is running? [/INST] Use kubectl get pods...

The model loses track of the conversation structure. It starts inventing new questions and answers randomly because it can’t parse where one turn ends and another begins.

How to Check Your Setup

When you look at your training data format, compare it to what your inference library applies. For example, if you’re using HuggingFace, check the tokenizer’s chat template:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-model")
print(tokenizer.chat_template)

You can also verify by checking what the tokenizer actually produces:

messages = [
    {"role": "user", "content": "What is RAG?"},
    {"role": "assistant", "content": "RAG is..."}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

Compare this output to your training data format. They should match exactly.

How to Fix It

You can either convert your training data to match the tokenizer’s expected format, or override the tokenizer to use your training data’s format. Converting large datasets is time-consuming, so overriding the tokenizer is usually faster.

Set the chat template in your training script:

tokenizer.chat_template = "{% for message in messages %}..."

The template string is a Jinja2 template that tells the tokenizer how to format conversations. You can find examples in HuggingFace docs or model cards.

For ChatML format:

tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}"

For Llama-2 format:

tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + message['content'] + ' [/INST]' }}{% else %}{{ message['content'] + '</s>' }}{% endif %}{% endfor %}"

Save the tokenizer with your model after setting the template so inference uses the same format:

tokenizer.save_pretrained("./fine-tuned-model")

Deployment-Specific Template Issues

Different inference engines handle templates differently:

vLLM: Auto-detects template from model config. If you used a custom template during training, you need to override vLLM’s detection:

# In vLLM deployment
llm = LLM(model="your-model", chat_template="path/to/template.jinja")

Ollama: Uses Modelfile template definition. Must explicitly match training:

TEMPLATE """{{- range .Messages }}
<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{- end }}"""

Text Generation Inference (TGI): Reads from model’s tokenizer_config.json. If template isn’t saved there, TGI applies defaults based on model name - which might be wrong.

Each inference engine has different defaults and assumptions. Document your template choice in the model card and explicitly set it in deployment config.

The Evaluation Impact

This problem also affects evaluation. If you evaluate a fine-tuned model with the wrong chat template, your metrics will be significantly off. According to HuggingFace’s evaluation guide, incorrect templates can drop accuracy by 20-30%.

When evaluating:

  • Use the same template as training
  • Check template consistency across evaluation datasets
  • Verify template matches model’s expected format
  • Don’t trust metrics if templates don’t align

How the Validator Detects Mismatches

The validator tool does pattern matching to detect template formats:

  • ChatML: <|im_start|> and <|im_end|> markers
  • Llama-2: [INST] and [/INST] tags, with optional <<SYS>>
  • Llama-3: <|start_header_id|> and <|end_header_id|>
  • Mistral: <s>[INST] format
  • Alpaca: ### Instruction: and ### Response:
  • Vicuna: USER: and ASSISTANT: markers

What does the validator do?

It does a couple of things:

  1. Scans your training file for these patterns
  2. Checks your config file for template specifications
  3. Compares detected formats
  4. Reports mismatches before training starts

Web version at clarifyintel.com/validate works similar. You can upload your training file and config free for a few validations per day.

Building Template Validation Into Your Pipeline

Don’t just check before training - make it part of your workflow:

Pre-training validation:

# Check before starting training
python validate_templates.py --training data.jsonl --config config.yaml

After training, verify inference setup:

tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")
test_messages = [{"role": "user", "content": "test"}]
output = tokenizer.apply_chat_template(test_messages, tokenize=False)
print(output)


<|user|>
test</s>

The <|user|> marker confirms ChatML format. If you see [INST] instead, there’s a mismatch.

Automating Validation in CI/CD

You can also add template validation to your CI pipeline to catch mismatches before training starts.

name: Validate Training Data
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Install validator
        run: pip install chat-template-detector
      
      - name: Check template consistency
        run: |
          chat-template-detector validate \
            --training-file data/train.jsonl \
            --model meta-llama/Llama-2-7b-chat-hf

For GitLab CI / Jenkins:

# Add to your pipeline script
pip install chat-template-detector
chat-template-detector validate \
  --training-file train.jsonl \
  --inference-config config.yaml

# Exit with error if mismatch found
if [ $? -ne 0 ]; then
  echo "Template mismatch detected - blocking deployment"
  exit 1
fi

Key Takeaways

Template mismatches feel obvious once you know about them, but they’re not intuitive when starting with fine-tuning. The lack of error messages makes it worse because you debug everything else first.

  • Different models use completely incompatible chat formats - they’re not variations of a standard, they’re different systems
  • Training succeeds regardless of template mismatch because the model learns whatever patterns you provide
  • Inference fails silently - no errors, just garbage output or knowledge loss
  • Template mismatch can cause repeating output, token soup, or instruction-following failures
  • System role handling differs across formats and breaks if mismatched
  • Each inference engine (vLLM, Ollama, TGI) has different template defaults
  • Evaluation metrics drop 20-30% with wrong templates
  • Check template consistency before training, not after wasting GPU hours
  • Override the tokenizer’s chat template explicitly in your training script
  • Document your template choice in model cards and deployment configs
  • Build validation into your training and deployment pipeline

If you’re getting weird output from a fine-tuned model and everything else looks right, check the chat template first. It’s probably that.

For more on chat template implementation details, see HuggingFace’s chat template documentation.