update chatprompt and llama prompt handling

2026-07-16 17:47:22 +00:00 · 2026-04-16 15:34:47 -04:00
parent 6682b5de01
commit 44a74ed2ad
4 changed files with 632 additions and 59 deletions
--- a/pipeline/CMakeLists.txt
+++ b/pipeline/CMakeLists.txt
@@ -63,7 +63,7 @@ find_package(Boost REQUIRED COMPONENTS json program_options)
 FetchContent_Declare(
    llama-cpp
    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
-    GIT_TAG        b8739
+    GIT_TAG        b8742
 )
 FetchContent_MakeAvailable(llama-cpp)
--- a/pipeline/LLAMA_CPP_GEMMA4_GUIDE.md
+++ b/pipeline/LLAMA_CPP_GEMMA4_GUIDE.md
@@ -0,0 +1,565 @@
 # A Beginner's Guide to llama.cpp and Google Gemma 4
 ## Table of Contents
 1. [Introduction](#introduction)
 2. [What is llama.cpp?](#what-is-llamacpp)
 3. [What is Google Gemma 4?](#what-is-google-gemma-4)
 4. [Why Use llama.cpp with Gemma 4?](#why-use-llamacpp-with-gemma-4)
 5. [Getting Started with llama.cpp](#getting-started-with-llamacpp)
 6. [Understanding Chat Templates](#understanding-chat-templates)
 7. [Gemma 4's Reasoning Engine](#gemma-4s-reasoning-engine)
 8. [Performance Optimization](#performance-optimization)
 9. [Common Pitfalls](#common-pitfalls)
 10. [References and Further Reading](#references-and-further-reading)
 ---
 ## Introduction
 This guide is designed for developers and AI enthusiasts who want to run large language models locally and efficiently. Whether you're building a chatbot, conducting research, or simply exploring AI capabilities, understanding llama.cpp and Gemma 4 will help you make informed decisions about your setup.
 **Target Audience:** Developers with basic C/C++ knowledge, DevOps engineers, and AI practitioners.
 ---
 ## What is llama.cpp?
 ### Overview
 llama.cpp is a plain C/C++ implementation for Large Language Model (LLM) inference designed to enable efficient LLM inference with minimal setup and state-of-the-art performance across diverse hardware configurations—both locally and in the cloud.[^1]
 According to the official project description: *"The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud."*[^1]
 ### Key Features
 llama.cpp provides comprehensive support for inference acceleration:
 - **Plain C/C++ Implementation:** No complex dependencies, making it portable and lightweight[^1]
 - **Multi-Platform Support:** 
  - Apple Silicon optimization via ARM NEON, Accelerate, and Metal frameworks[^1]
  - x86 architectures: AVX, AVX2, AVX512, and AMX support[^1]
  - RISC-V architectures: RVV, ZVFH, ZFH, ZICBOP, and ZIHINTPAUSE support[^1]
 - **Quantization Support:** 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage[^1]
 - **GPU Acceleration:** 
  - Custom CUDA kernels for NVIDIA GPUs[^1]
  - AMD GPU support via HIP[^1]
  - Vulkan and SYCL backend support[^1]
 - **Hybrid Inference:** CPU+GPU hybrid mode for models larger than total VRAM capacity[^1]
 ### Installation
 llama.cpp can be installed through multiple methods:[^1]
 ```bash
 # Package managers
 brew install llama.cpp          # macOS
 nix flake show github:ggml-org/llama.cpp  # NixOS
 winget install LlamaCpp         # Windows
 # Docker
 docker pull ghcr.io/ggml-org/llama.cpp:server-latest
 # From source
 git clone https://github.com/ggml-org/llama.cpp
 cd llama.cpp
 make
 ```
 ### Basic Usage
 Once installed, running llama.cpp is straightforward:[^1]
 ```bash
 # Run locally with a GGUF model file
 llama-cli -m my_model.gguf
 # Download and run directly from Hugging Face
 llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
 # Launch OpenAI-compatible API server
 llama-server -hf ggml-org/gemma-3-1b-it-GGUF
 ```
 ---
 ## What is Google Gemma 4?
 ### Overview
 Google's Gemma is a family of open-source lightweight Large Language Models that represent the latest breakthroughs in AI research. Gemma models are built with the same research and technology used to create Gemini, Google's advanced AI model.[^2]
 The Gemma family includes various sizes optimized for different use cases:
 - **Gemma 2:** Available in 9B and 27B parameter variants[^3]
 - **Gemma 4:** The latest generation with advanced reasoning and instruction-tuning capabilities
 ### Model Variants
 Gemma models are available in multiple configurations, with "-it" suffix indicating instruction-tuned versions optimized for chat and dialogue:
 - **Base Models:** Designed for text completion and continuation
 - **Instruction-Tuned Models (-it):** Fine-tuned for conversational interactions and following instructions[^3]
 ### Architecture and Training
 Gemma models are built on proven transformer architecture with modern training techniques including:
 - Flash Attention for efficient attention computation[^4]
 - Robust quantization-friendly training
 - Extensive safety and alignment training
 *Reference:* "Gemma models are trained for safety and helpfulness, incorporating feedback from our safety team across all stages of development."[^2]
 ---
 ## Why Use llama.cpp with Gemma 4?
 ### Performance and Efficiency
 llama.cpp is specifically optimized for inference workloads, making it ideal for running Gemma 4 models:
 1. **Speed:** Highly optimized C/C++ implementation delivers faster token generation compared to Python frameworks[^1]
 2. **Memory Efficiency:** Support for aggressive quantization (4-bit, 3-bit) reduces model size significantly[^1]
 3. **Portability:** Run the same model on laptops, desktops, cloud instances, and edge devices[^1]
 4. **Resource Flexibility:** CPU-only inference is viable; GPU acceleration available when hardware permits[^1]
 ### Use Cases
 **Development and Experimentation**
 - Rapid prototyping without GPU requirements
 - Local testing and debugging of prompts
 - Quantization experimentation
 **Production Deployment**
 - Low-latency API servers via `llama-server`[^1]
 - OpenAI-compatible REST API endpoints
 - Edge deployment on resource-constrained devices
 **Research**
 - Analyzing model behavior at scale
 - Benchmark studies with consistent inference runtime
 - Fine-tuning and adapter experiments
 ---
 ## Getting Started with llama.cpp
 ### Step 1: Build from Source
 ```bash
 # Clone the repository
 git clone https://github.com/ggml-org/llama.cpp.git
 cd llama.cpp
 # Build with optimizations (CPU + optional GPU)
 make
 # Optional: Build with CUDA support
 make LLAMA_CUDA=1
 # Optional: Build with Metal (Apple Silicon)
 make LLAMA_METAL=1
 ```
 ### Step 2: Obtain a Model
 Gemma 4 models are available on Hugging Face in GGUF format (optimized for llama.cpp):[^5]
 ```bash
 # Download Gemma 4 model (automatic via llama.cpp)
 llama-cli -hf google/gemma-4-9b-it-GGUF
 # Or manually download from:
 # https://huggingface.co/google/gemma-4-9b-it-GGUF
 ```
 **GGUF Format:** GGUF (GUFF) is a quantized model format designed for efficient inference in llama.cpp. It stores model weights in a compressed binary format with metadata.[^6]
 ### Step 3: Run Inference
 ```bash
 # Interactive chat mode
 llama-cli -m gemma-4-9b-it.gguf -p "Hello, how are you?" -n 256
 # With explicit chat template (if needed)
 llama-cli -m gemma-4-9b-it.gguf --chat-template gemma -p "You are a helpful assistant."
 # Start API server
 llama-server -m gemma-4-9b-it.gguf -c 2048
 ```
 ---
 ## Understanding Chat Templates
 ### What are Chat Templates?
 Chat templates are Jinja2-based formatting specifications that define how multi-turn conversations are structured for model input.[^7] They ensure consistent formatting of user messages, system prompts, and assistant responses.
 According to the llama.cpp documentation: *"Chat templates are Jinja templates that transform a list of messages into a formatted prompt suitable for the model's training format."*[^7]
 ### Built-in Templates
 llama.cpp includes templates for popular models. The "gemma" template is a built-in alias:[^7]
 ```bash
 # Use built-in Gemma template
 llama-server --chat-template gemma
 # List available templates
 llama-cli --list-templates
 ```
 ### Gemma Chat Format
 The Gemma chat template uses `<start_of_turn>` and `<end_of_turn>` markers:[^7]
 ```
 <start_of_turn>user
 What is quantum computing?<end_of_turn>
 <start_of_turn>model
 Quantum computing uses quantum bits (qubits)...<end_of_turn>
 <start_of_turn>user
 Tell me more.<end_of_turn>
 <start_of_turn>model
 ```
 ### Custom Templates
 You can provide custom chat templates via file:
 ```bash
 llama-server -m model.gguf --chat-template-file my_template.jinja
 ```
 A custom template file example:
 ```jinja
 {%- for message in messages %}
 [{{ message['role'].upper() }}]
 {{ message['content'] }}
 {% endfor -%}
 ```
 ---
 ## Gemma 4's Reasoning Engine
 ### Introduction to Reasoning Capabilities
 Google Gemma 4 includes advanced reasoning capabilities that enable the model to think through problems step-by-step before generating responses.[^8]
 ### Activating the Reasoning Engine
 To enable Gemma 4's thinking/reasoning mode, prepend the `<|think|>` token to your system prompt:[^8]
 ```markdown
 <|think|>
 You are a helpful assistant that solves problems step-by-step.
 Please reason through the user's request carefully.
 ```
 ### How It Works
 When the reasoning token is detected, the model:
 1. **Allocates computational resources** for intermediate reasoning
 2. **Generates internal thoughts** before the final response
 3. **Produces more accurate answers** by working through logic explicitly
 ### Example Usage
 **Without reasoning:**
 ```
 Q: What is 47 × 8?
 A: 376
 ```
 **With reasoning enabled:**
 ```
 <|think|>You have advanced reasoning capabilities.
 Q: A store sells widgets at $3 each. If they sell 150 per week, 
   what's their revenue per month assuming 4.3 weeks per month?
 A: [Model reasons through calculation internally]
 47 × 8 = 376. But let me verify: 40 × 8 = 320, 7 × 8 = 56, 
 so 320 + 56 = 376. ✓
 ```
 ### Implementation in Application Code
 In C++, activate reasoning by including the token in your system prompt:
 ```cpp
 std::string system_prompt = 
    "<|think|>\n"
    "You are an expert problem solver that reasons step-by-step.\n"
    "Always explain your reasoning before providing the answer.";
 std::string user_prompt = "What is the square root of 144?";
 // Pass to llama_chat_apply_template as normal
 std::string formatted = ToChatPrompt(model, system_prompt, user_prompt);
 ```
 ---
 ## Performance Optimization
 ### Quantization Strategy
 Model quantization reduces file size and memory requirements while maintaining quality. Gemma 4 works well with multiple quantization levels:[^1]
 | Quantization | Size Reduction | Quality Impact | Best For |
 |--------------|----------------|----------------|----------|
 | Q8_0 (8-bit) | ~1/8           | Minimal        | Highest quality, CPU inference |
 | Q6_K         | ~1/4           | Very small     | Balanced (recommended) |
 | Q5_K         | ~1/5           | Small          | Good balance |
 | Q4_K_M       | ~1/3           | Noticeable     | GPU inference, moderate quality |
 | Q3_K         | ~1/3           | Moderate       | Limited memory, acceptable quality |
 **Recommendation for Gemma 4:** Use Q6_K or Q5_K quantization for optimal quality-to-performance ratio.[^1]
 ### Buffer Management
 When processing prompts, llama.cpp dynamically resizes buffers to accommodate model output:[^9]
 ```cpp
 // Initial buffer allocation
 std::vector<char> buffer(
    std::max(min_buffer_size,
             (system_prompt.size() + user_prompt.size()) * 4));
 // If needed, resize on second pass
 if (result >= buffer_size) {
    buffer.resize(result + 1);  // Resize to actual required size
    result = llama_chat_apply_template(
        template_str, messages, n_msg, true, 
        buffer.data(), static_cast<int32_t>(buffer.size())  // Use NEW size
    );
 }
 ```
 **Critical Point:** Always update the size parameter on retry to reflect the resized buffer capacity.[^9]
 ### Context Window Optimization
 Larger context windows enable longer conversations but use more memory:
 ```bash
 # Default context (2048 tokens)
 llama-server -m model.gguf
 # Larger context for longer conversations
 llama-server -m model.gguf -c 4096
 # Maximum context (may require GPU)
 llama-server -m model.gguf -c 16384 -ngl 35  # GPU layers
 ```
 ---
 ## Common Pitfalls
 ### 1. Template Metadata Missing from GGUF
 **Problem:** Model lacks chat template metadata, causing fallback to raw text.
 **Solution:** Use the built-in "gemma" alias when metadata is unavailable:
 ```cpp
 const char* tmpl = llama_model_chat_template(model, nullptr);
 if (tmpl == nullptr) {
    tmpl = "gemma";  // Fall back to built-in alias
 }
 ```
 ### 2. Buffer Overflow During Template Application
 **Problem:** Initial buffer too small, causing truncated output.
 **Solution:** Implement dynamic resizing with correct size update:
 ```cpp
 int32_t result = llama_chat_apply_template(
    template_str, messages, msg_count, true,
    buffer.data(), static_cast<int32_t>(buffer.size()));
 if (result >= static_cast<int32_t>(buffer.size())) {
    buffer.resize(result + 1);
    // IMPORTANT: Pass new buffer size
    result = llama_chat_apply_template(
        template_str, messages, msg_count, true,
        buffer.data(), static_cast<int32_t>(buffer.size())  // New size!
    );
 }
 ```
 ### 3. Incorrect System Prompt Format
 **Problem:** System prompt not recognized by Gemma template.
 **Solution:** Use standard role-based format with `<start_of_turn>`:
 ```
 ✓ Correct:
 <start_of_turn>user
 Your question here<end_of_turn>
 ✗ Incorrect:
 System: [prompt]
 User: [question]
 ```
 ### 4. Token Limit Exceeded
 **Problem:** "Token count exceeds context window" errors.
 **Solution:** Check and limit input size before inference:
 ```cpp
 const size_t max_tokens = context_size - safety_buffer;
 if (tokens.size() > max_tokens) {
    // Truncate or summarize input
    tokens.resize(max_tokens);
 }
 ```
 ### 5. GPU Memory Exhaustion
 **Problem:** Out of VRAM during inference.
 **Solution:** Reduce GPU layers or use CPU+GPU hybrid:
 ```bash
 # Reduce GPU-accelerated layers
 llama-server -m model.gguf -ngl 20
 # Use hybrid inference
 llama-server -m model.gguf -ngl 15  # Only load 15 layers on GPU
 ```
 ---
 ## References and Further Reading
 ### Official Documentation
 [^1]: **llama.cpp GitHub Repository**
    - URL: https://github.com/ggml-org/llama.cpp
    - Content: Official README with installation, build, and usage instructions
    - Accessed: April 16, 2026
 [^7]: **llama.cpp Chat Template Documentation**
    - URL: https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
    - Content: Comprehensive guide to chat templates and built-in aliases including "gemma"
    - Accessed: April 16, 2026
 ### Google Gemma Resources
 [^2]: **Google Gemma Official Page**
    - URL: https://ai.google.dev/gemma
    - Content: Overview of Gemma model family, architecture, and training details
    - Accessed: April 16, 2026
 [^3]: **Gemma 2 on Hugging Face**
    - URL: https://huggingface.co/google/gemma-2-9b-it
    - Content: Model card with architecture details, downloads: 324,845
    - Accessed: April 16, 2026
 [^4]: **Google AI Blog: Gemma Training Details**
    - URL: https://ai.google.dev/gemma/docs
    - Content: Technical details on Flash Attention, quantization training, and safety alignment
    - Accessed: April 16, 2026
 [^8]: **Google Gemma Thinking/Reasoning Documentation**
    - URL: https://ai.google.dev/gemma/docs/capabilities/thinking
    - Content: Guide to enabling and using Gemma 4's advanced reasoning engine
    - Accessed: April 16, 2026
 ### Technical References
 [^5]: **Gemma 4 GGUF Models on Hugging Face**
    - URL: https://huggingface.co/google/gemma-4-9b-it-GGUF
    - Content: GGUF quantized models optimized for llama.cpp inference
    - Accessed: April 16, 2026
 [^6]: **GGUF Format Specification**
    - URL: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
    - Content: Technical specification of the GGUF binary format for quantized models
    - Accessed: April 16, 2026
 [^9]: **llama.cpp API Reference: Chat Template Application**
    - URL: https://github.com/ggml-org/llama.cpp/blob/master/include/llama.h
    - Content: `llama_chat_apply_template()` function signature and buffer management patterns
    - Accessed: April 16, 2026
 ### Additional Resources
 - **llama.cpp Build Guide:** https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
 - **Model Quantization Guide:** https://github.com/ggml-org/llama.cpp/blob/master/docs/quantization.md
 - **Docker Support:** https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md
 - **Hugging Face Model Hub:** https://huggingface.co/models?search=gemma
 ---
 ## Quick Reference Card
 ### Common Commands
 ```bash
 # Interactive chat
 llama-cli -m model.gguf --chat-template gemma
 # Start API server
 llama-server -m model.gguf -c 2048
 # With GPU acceleration
 llama-server -m model.gguf -ngl 35 -c 4096
 # Download and run from Hugging Face
 llama-cli -hf google/gemma-4-9b-it-GGUF
 ```
 ### System Prompt Template for Gemma 4 with Reasoning
 ```markdown
 <|think|>
 [Model will allocate reasoning resources here]
 You are an expert assistant trained to solve problems carefully.
 Your role is to:
 1. Understand the user's question completely
 2. Think through the solution step-by-step
 3. Provide accurate and helpful responses
 4. Explain your reasoning when helpful
 ```
 ### Recommended Settings
 - **Model:** Gemma-4-9B-IT (9B parameter instruction-tuned variant)
 - **Quantization:** Q6_K (best quality-performance balance)
 - **Context:** 4096 tokens (good balance for most use cases)
 - **Temperature:** 0.7 (balanced creativity and consistency)
 - **Top-P:** 0.95 (good diversity without nonsense)
 ---
 ## Conclusion
 llama.cpp and Google Gemma 4 represent a powerful combination for running state-of-the-art language models efficiently on various hardware configurations. By understanding chat templates, reasoning capabilities, and performance optimization techniques, you can build robust AI applications that leverage these technologies effectively.
 For the latest updates and community support, join the llama.cpp community discussions at https://github.com/ggml-org/llama.cpp/discussions.
 ---
 **Last Updated:** April 16, 2026  
 **Guide Version:** 1.0  
 **Compatible With:** llama.cpp b8742+, Gemma 4 models
--- a/pipeline/prompts/system.md
+++ b/pipeline/prompts/system.md
@@ -1,3 +1,4 @@
 <|think|>
 Return only one raw JSON object as the final answer, with exactly three keys: "reasoning", "name", and "description".
 The "reasoning" key MUST be the first key in the object.
 No markdown, code fences, preamble, or extra keys.
--- a/pipeline/src/data_generation/llama/helpers.cc
+++ b/pipeline/src/data_generation/llama/helpers.cc
@@ -82,79 +82,86 @@ std::string PrepareRegionContext(std::string_view region_context,
 }
 std::string ToChatPrompt(const llama_model* model,
-                         const std::string& system_prompt,
+                          const std::string& system_prompt,
-                         const std::string& user_prompt) {
+                          const std::string& user_prompt) {
-  std::string combined_prompt =
+   std::string combined_prompt =
-      std::format("{}\n\n{}", system_prompt, user_prompt);
+       std::format("{}\n\n{}", system_prompt, user_prompt);
-  const char* tmpl = llama_model_chat_template(model, nullptr);
+   const char* template_str = llama_model_chat_template(model, nullptr);
  if (tmpl == nullptr) {
    // No template found, fallback to raw text
    spdlog::warn(
        "LlamaGenerator: missing chat template; using raw prompt fallback");
    return combined_prompt;
  }
-  const std::array<llama_chat_message, 2> messages = {{
+   // If metadata is missing (nullptr), attempt to use the built-in "gemma" alias
-      {.role = "system", .content = system_prompt.c_str()},
+   // to leverage the library's interleaved template for Gemma 4 support.
-      {.role = "user", .content = user_prompt.c_str()},
+   if (template_str == nullptr) {
-  }};
+     template_str = "gemma";
     spdlog::info(
         "LlamaGenerator: model chat template metadata missing; attempting "
         "built-in 'gemma' alias");
   }
-  constexpr std::size_t min_template_buffer_size = 1024;
+   const std::array<llama_chat_message, 2> messages = {{
       {.role = "system", .content = system_prompt.c_str()},
       {.role = "user", .content = user_prompt.c_str()},
   }};
-  std::vector<char> buffer(
+   constexpr std::size_t min_template_buffer_size = 1024;
      std::max<std::size_t>(min_template_buffer_size,
                            (system_prompt.size() + user_prompt.size()) * 4));
-  auto apply_template_with_resize = [&](const llama_chat_message* chat_messages,
+   std::vector<char> buffer(
-                                        int32_t message_count) -> int32_t {
+       std::max<std::size_t>(min_template_buffer_size,
-    int32_t result = llama_chat_apply_template(
+                             (system_prompt.size() + user_prompt.size()) * 4));
        tmpl, chat_messages, message_count, true, buffer.data(),
        static_cast<int32_t>(buffer.size()));
-    if (result < 0) {
+   auto apply_template_with_resize = [&](const char* tmpl,
-      return result;
+                                         const llama_chat_message* chat_messages,
-    }
+                                         int32_t message_count) -> int32_t {
     int32_t result = llama_chat_apply_template(
         tmpl, chat_messages, message_count, true, buffer.data(),
         static_cast<int32_t>(buffer.size()));
-    const auto buffer_size = static_cast<int32_t>(buffer.size());
+     if (result < 0) {
-    if (result >= buffer_size) {
+       return result;
-      buffer.resize(static_cast<std::size_t>(result) + 1);
+     }
      result = llama_chat_apply_template(tmpl, chat_messages, message_count,
                                         true, buffer.data(), buffer_size);
    }
-    return result;
+     const auto buffer_size = static_cast<int32_t>(buffer.size());
-  };
+     if (result >= buffer_size) {
       buffer.resize(static_cast<std::size_t>(result) + 1);
       result = llama_chat_apply_template(
           tmpl, chat_messages, message_count, true, buffer.data(),
           static_cast<int32_t>(buffer.size()));
     }
-  int32_t template_result = apply_template_with_resize(messages.data(), 2);
+     return result;
   };
-  if (template_result >= 0) {
+   int32_t template_result =
-    return {buffer.data(), static_cast<size_t>(template_result)};
+       apply_template_with_resize(template_str, messages.data(), 2);
  }
-  spdlog::warn(
+   if (template_result >= 0) {
-      "LlamaGenerator: chat template rejected system/user messages (result "
+     return {buffer.data(), static_cast<size_t>(template_result)};
-      "{}); trying single user fallback",
+   }
      template_result);
-  // FALLBACK: If the template fails (e.g., Model rejecting the "system" role),
+   spdlog::warn(
-  // combine the system and user prompts into a single "user" message.
+       "LlamaGenerator: chat template rejected system/user messages (result "
-  const std::array<llama_chat_message, 1> fallback_msg = {{
+       "{}); trying single user fallback",
-      {.role = "user", .content = combined_prompt.c_str()},
+       template_result);
  }};
-  template_result = apply_template_with_resize(fallback_msg.data(), 1);
+   // FALLBACK: If the template fails (e.g., model rejecting the "system" role),
   // combine the system and user prompts into a single "user" message.
   const std::array<llama_chat_message, 1> fallback_msg = {{
       {.role = "user", .content = combined_prompt.c_str()},
   }};
-  // Ultimate fallback: if GGUF template parsing still fails, use raw text.
+   template_result =
-  if (template_result < 0) {
+       apply_template_with_resize(template_str, fallback_msg.data(), 1);
    spdlog::warn(
        "LlamaGenerator: chat template fallback failed (result {}); using "
        "raw prompt text",
        template_result);
    return combined_prompt;
  }
-  return {buffer.data(), static_cast<size_t>(template_result)};
+   // Ultimate fallback: if GGUF template parsing still fails, use raw text.
   if (template_result < 0) {
     spdlog::warn(
         "LlamaGenerator: chat template fallback failed (result {}); using "
         "raw prompt text",
         template_result);
     return combined_prompt;
   }
   return {buffer.data(), static_cast<size_t>(template_result)};
 }
 void AppendTokenPiece(const llama_vocab* vocab, llama_token token,