Begin work on Runpod docker config

This commit is contained in:
Aaron Po
2026-05-03 23:32:08 -04:00
parent 26635ace84
commit 97b2ffeae4
16 changed files with 402 additions and 90 deletions

View File

@@ -18,6 +18,7 @@ descriptions via a local GGUF model or a deterministic mock.
- [Build](#build) - [Build](#build)
- [Model](#model) - [Model](#model)
- [Run](#run) - [Run](#run)
- [Docker / RunPod](#docker--runpod)
- [Architecture](#architecture) - [Architecture](#architecture)
- [Pipeline Stages](#pipeline-stages) - [Pipeline Stages](#pipeline-stages)
- [Key Components](#key-components) - [Key Components](#key-components)
@@ -51,7 +52,7 @@ step.
### Build ### Build
Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and Requirements: C++20 compiler, CMake 3.31+, OpenSSL, Boost (JSON and
ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system
SQLite package is required. SQLite package is required.
@@ -60,6 +61,16 @@ cmake -S . -B build
cmake --build build cmake --build build
``` ```
CMake automatically detects whether a compatible llama.cpp installation is
present on the system (`libllama`, `libggml`, `libggml-base`, and `llama.h`
visible on the default search paths). If found, it links against those
libraries and skips the FetchContent build. If not found, it fetches and builds
llama.cpp from source at tag `b9012`. No additional flags are required in
either case.
Metal is enabled automatically on Apple Silicon. CUDA or HIP/ROCm is detected
automatically on Linux when the relevant toolkit is present.
### Model ### Model
> Skip this step if you only need `--mocked`. > Skip this step if you only need `--mocked`.
@@ -74,33 +85,124 @@ curl -L \
### Run ### Run
Run from `build/` so the copied `locations.json` and `prompts/` are available. Run from `build/` so the copied `locations.json` and `prompts/` are available.
Each run also writes a fresh dated SQLite file such as Each run writes a fresh dated SQLite file such as
`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. `biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory.
```bash ```bash
./biergarten-pipeline --mocked ./biergarten-pipeline --mocked
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
./biergarten-pipeline \
--model ../models/google_gemma-4-E4B-it-Q6_K.gguf \
--prompt-dir prompts \
--temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
``` ```
#### CLI Flags #### CLI Flags
| Flag | Purpose | | Flag | Purpose |
| --------------- | ------------------------------------------------------- | | --------------- | ---------------------------------------------------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. | | `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. | | `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--temperature` | Sampling temperature. Default: `1.0`. | | `--prompt-dir` | Directory containing prompt files (e.g. `BREWERY_GENERATION.md`). Required unless `--mocked` is set. |
| `--top-p` | Nucleus sampling. Default: `0.95`. | | `--output, -o` | Directory for generated SQLite artifacts. Default: `output`. |
| `--top-k` | Top-k sampling. Default: `64`. | | `--log-path` | Path for application logs. Default: `pipeline.log`. |
| `--n-ctx` | Context window size. Default: `8192`. | | `--temperature` | Sampling temperature. Default: `1.0`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). | | `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--help, -h` | Print usage and exit. | | `--top-k` | Top-k sampling. Default: `64`. |
| `--n-ctx` | Context window size. Default: `8192`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). |
| `--help, -h` | Print usage and exit. |
`--mocked` and `--model` are mutually exclusive. Omitting both exits with an `--mocked` and `--model` are mutually exclusive. Omitting both exits with an
error before the pipeline starts. Sampling flags are ignored when `--mocked` is error before the pipeline starts. Sampling flags are ignored when `--mocked` is
set. set.
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after The post-build step copies `prompts/` into `build/prompts/`. Rebuild after
editing `prompts/system.md`. editing any prompt file.
---
## Docker / RunPod
The `tooling/pipeline/runpod/` directory contains a GPU-ready container
configuration for running the pipeline on RunPod or any Docker host with an
NVIDIA GPU.
### How it works
The container uses a two-stage build. The first stage pulls prebuilt
`libllama`, `libggml`, and backend plugin libraries (including `libggml-cuda.so`
and the CPU variant plugins) from `ghcr.io/ggml-org/llama.cpp:full-cuda`. The
second stage copies those libraries into `/usr/local/lib` and runs `ldconfig` so
the dynamic linker and `dlopen` calls from `ggml_backend_load_all()` can resolve
the CUDA backend plugin at runtime. llama.cpp headers are cloned at the matching
tag and installed into `/usr/local/include`. CMake auto-detects both and skips
the FetchContent source build entirely, keeping image build times short.
`GGML_BACKEND_PATH` is set to `/usr/local/lib` so llama.cpp knows where to scan
for backend plugins.
### Build the image
Run from the `tooling/pipeline/` directory (the CMake project root), not from
inside `runpod/`, so the `COPY . .` step picks up the full project context.
```bash
docker build -t biergarten-pipeline:latest -f runpod/Dockerfile .
```
To monitor the full build output and confirm CMake selects the system llama.cpp:
```bash
docker build \
--progress=plain \
--no-cache \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```
Look for `[biergarten] Found system llama.cpp — skipping FetchContent` in the
output to confirm the fast path was taken.
### Run in mocked mode
No model or GPU required. Useful for validating the pipeline logic and SQLite
export path.
```bash
docker run --rm \
-e BIERGARTEN_MODE=mocked \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
### Run in live mode
Mount your GGUF model before starting. The container validates the model path
before launching the binary.
```bash
docker run --rm \
--runtime=nvidia \
-e BIERGARTEN_MODE=live \
-e GGML_BACKEND_PATH="/usr/local/lib/libggml-cuda.so" \
-v "$PWD/models:/workspace/models" \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
The model must be present at `./models/google_gemma-4-E4B-it-Q6_K.gguf` on the
host. See [Model](#model) above for the download command.
### RunPod deployment
Use a GPU pod template. Mount persistent storage for `/workspace/models`,
`/workspace/output`, and `/workspace/logs`. Set `BIERGARTEN_MODE=live` in the
template environment. See `tooling/pipeline/runpod/pod-template.yaml` for a
starter template.
--- ---
@@ -197,16 +299,18 @@ code, latitude, and longitude for each entry.
## Tech Stack ## Tech Stack
- C++20 - C++20
- CMake 3.24+ - CMake 3.31+
- Boost.JSON, Boost.ProgramOptions, Boost.DI - Boost.JSON, Boost.ProgramOptions, Boost.DI
- spdlog - spdlog
- libcurl - cpp-httplib (with OpenSSL)
- SQLite amalgamation fetched and compiled via CMake FetchContent - SQLite amalgamation fetched and compiled via CMake FetchContent
- llama.cpp - llama.cpp (auto-detected from system install or fetched via FetchContent)
- Docker with NVIDIA CUDA 12.6 base image for GPU container builds
- RunPod for cloud GPU inference
The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is The build fetches Boost.DI, spdlog, and SQLite via CMake. llama.cpp is fetched
enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit only when a system installation is not detected. Metal is enabled on Apple
is present. Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
> **Code Style:** Modern C++20 throughout — RAII for ownership, > **Code Style:** Modern C++20 throughout — RAII for ownership,
> `std::unique_ptr` for injected dependencies, `std::optional` for parse > `std::unique_ptr` for injected dependencies, `std::optional` for parse
@@ -218,7 +322,7 @@ is present.
## Tested Hardware ## Tested Hardware
### ARM macOS - M1 Pro ### ARM macOS M1 Pro
| | | | | |
| --------- | --------------------------------- | | --------- | --------------------------------- |
@@ -229,7 +333,7 @@ is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with Metal | | Inference | llama.cpp with Metal |
### x86_64 Linux - NVIDIA RTX 2000 ### x86_64 Linux NVIDIA RTX 2000
| | | | | |
| --------- | ------------------------------ | | --------- | ------------------------------ |
@@ -240,6 +344,15 @@ is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with CUDA 12.x | | Inference | llama.cpp with CUDA 12.x |
### x86_64 Linux — Docker / RunPod (NVIDIA CUDA)
| | |
| --------- | ------------------------------------------- |
| Host | RunPod GPU pod |
| Base | nvidia/cuda:12.6.3-devel-ubuntu24.04 |
| Model | Gemma 4 E4B Q6_K |
| Inference | llama.cpp prebuilt CUDA backends via dlopen |
--- ---
## Fixture Strategy ## Fixture Strategy
@@ -260,8 +373,9 @@ is present.
| `includes/` | Public headers and shared models. | | `includes/` | Public headers and shared models. |
| `src/` | Implementation files. | | `src/` | Implementation files. |
| `locations.json` | Curated city input copied into the build tree. | | `locations.json` | Curated city input copied into the build tree. |
| `prompts/` | System prompt used by the model-backed path. | | `prompts/` | System prompts used by the model-backed path. |
| `diagrams/` | Architecture and pipeline diagrams. | | `diagrams/` | Architecture and pipeline diagrams. |
| `tooling/pipeline/runpod/` | Dockerfile, launcher, and RunPod pod template. |
| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. | | `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. |
--- ---
@@ -276,6 +390,7 @@ is present.
- `src/data_generation/llama/` — local inference, prompt loading, output - `src/data_generation/llama/` — local inference, prompt loading, output
validation. validation.
- `src/data_generation/mock/` — deterministic fallback. - `src/data_generation/mock/` — deterministic fallback.
- `tooling/pipeline/runpod/` — container build and runtime launcher.
--- ---

View File

@@ -29,7 +29,7 @@ if (Are arguments valid?) then (no)
else (yes) else (yes)
endif endif
:Init CurlGlobalState & LlamaBackendState; :Init OpenSSL global state & LlamaBackendState;
:di::make_injector(...); :di::make_injector(...);
:injector.create<std::unique_ptr<BiergartenDataGenerator>>(); :injector.create<std::unique_ptr<BiergartenDataGenerator>>();
:BiergartenDataGenerator::Run(); :BiergartenDataGenerator::Run();

View File

@@ -52,7 +52,7 @@ interface WebClient <<interface>> {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class CURLWebClient { class HttpWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -130,7 +130,7 @@ BiergartenDataGenerator *-- IExportService : owns
IEnrichmentService <|.. WikipediaService : implements IEnrichmentService <|.. WikipediaService : implements
WikipediaService *-- WebClient : owns WikipediaService *-- WebClient : owns
WebClient <|.. CURLWebClient : implements WebClient <|.. HttpWebClient : implements
DataGenerator <|.. MockGenerator : implements DataGenerator <|.. MockGenerator : implements
DataGenerator <|.. LlamaGenerator : implements DataGenerator <|.. LlamaGenerator : implements

View File

@@ -13,7 +13,7 @@ if (Invalid args?) then (yes)
stop stop
else (no) else (no)
endif endif
:Init CurlGlobalState & LlamaBackendState; :Init OpenSSL global state & LlamaBackendState;
:Build DI injector; :Build DI injector;
:Initialize SqliteExportService; :Initialize SqliteExportService;

View File

@@ -356,7 +356,7 @@ package "Infrastructure: Enrichment" {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class CURLWebClient { class HttpWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -520,7 +520,7 @@ CheckinDistributionStrategy <|.. RandomCheckinStrategy
FollowGenerationStrategy <|.. RandomFollowStrategy FollowGenerationStrategy <|.. RandomFollowStrategy
FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy
EnrichmentService <|.. WikipediaService EnrichmentService <|.. WikipediaService
WebClient <|.. CURLWebClient WebClient <|.. HttpWebClient
DataGenerator <|.. MockGenerator DataGenerator <|.. MockGenerator
DataGenerator <|.. LlamaGenerator DataGenerator <|.. LlamaGenerator
PromptFormatter <|.. Gemma4JinjaPromptFormatter PromptFormatter <|.. Gemma4JinjaPromptFormatter

View File

@@ -0,0 +1,9 @@
build/
cmake-build-debug/
.git/
.idea/
**/*.sqlite
**/*.log
**/*.sqlite3
**/*.db

View File

@@ -1,41 +1,45 @@
cmake_minimum_required(VERSION 3.31) cmake_minimum_required(VERSION 3.31)
project(biergarten-pipeline) project(biergarten-pipeline)
# Set policy to allow FetchContent_Populate for header-only libraries
# that have outdated CMakeLists.txt files
cmake_policy(SET CMP0169 OLD)
# 1. Build Options # 1. Build Options
option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF) option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF)
if (BIERGARTEN_MOCK_ONLY) if(BIERGARTEN_MOCK_ONLY)
message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.") message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.")
endif () endif()
# 2. Platform & GPU Detection # 2. Platform & GPU Detection
if (NOT UNIX) if(NOT UNIX)
message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).") message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif () endif()
if (APPLE) if(APPLE)
if (CMAKE_SYSTEM_PROCESSOR MATCHES "arm64") if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64")
message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.") message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.")
set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE) set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE)
else () else()
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.") message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE) set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif () endif()
else () else()
find_package(CUDAToolkit QUIET) find_package(CUDAToolkit QUIET)
find_package(hip CONFIG QUIET) find_package(hip CONFIG QUIET)
if (CUDAToolkit_FOUND) if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.") message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE) set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native) set(CMAKE_CUDA_ARCHITECTURES native)
elseif (hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm") elseif(hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.") message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE) set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else () else()
message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.") message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.")
endif () endif()
endif () endif()
# 3. Project-wide Settings # 3. Project-wide Settings
set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD 20)
@@ -51,16 +55,23 @@ include(FetchContent)
find_package(Boost REQUIRED COMPONENTS json program_options) find_package(Boost REQUIRED COMPONENTS json program_options)
# Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency) # Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency)
# Header-only library, so we only fetch without invoking its CMakeLists.txt
FetchContent_Declare( FetchContent_Declare(
boost-di boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0 GIT_TAG v1.3.0
GIT_SHALLOW TRUE
) )
FetchContent_MakeAvailable(boost-di) FetchContent_GetProperties(boost-di)
if (TARGET Boost.DI AND NOT TARGET boost::di) if(NOT boost-di_POPULATED)
add_library(boost::di ALIAS Boost.DI) FetchContent_Populate(boost-di)
endif () endif()
add_library(boost_di INTERFACE)
add_library(boost::di ALIAS boost_di)
target_include_directories(boost_di INTERFACE
$<BUILD_INTERFACE:${boost-di_SOURCE_DIR}/include>
)
# SQLite amalgamation # SQLite amalgamation
FetchContent_Declare( FetchContent_Declare(
sqlite_amalgamation sqlite_amalgamation
@@ -69,21 +80,38 @@ FetchContent_Declare(
EXCLUDE_FROM_ALL EXCLUDE_FROM_ALL
) )
FetchContent_MakeAvailable(sqlite_amalgamation) FetchContent_MakeAvailable(sqlite_amalgamation)
if (NOT TARGET sqlite3) if(NOT TARGET sqlite3)
add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c) add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c)
target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR}) target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR})
target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1) target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1)
endif () endif()
# llama.cpp — skipped for mock-only builds # llama.cpp — skipped for mock-only builds
if (NOT BIERGARTEN_MOCK_ONLY) if(NOT BIERGARTEN_MOCK_ONLY)
FetchContent_Declare( find_library(LLAMA_LIB NAMES llama)
llama-cpp find_library(GGML_LIB NAMES ggml)
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git find_library(GGML_BASE_LIB NAMES ggml-base)
GIT_TAG b8742 find_path(LLAMA_INC_DIR NAMES llama.h PATH_SUFFIXES include)
)
FetchContent_MakeAvailable(llama-cpp) if(LLAMA_LIB AND GGML_LIB AND GGML_BASE_LIB AND LLAMA_INC_DIR)
endif () message(STATUS "[biergarten] Found system llama.cpp — skipping FetchContent")
add_library(llama SHARED IMPORTED)
set_target_properties(llama PROPERTIES
IMPORTED_LOCATION "${LLAMA_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INC_DIR}"
INTERFACE_LINK_LIBRARIES "${GGML_LIB};${GGML_BASE_LIB}"
)
else()
message(STATUS "[biergarten] System llama.cpp not found — fetching via FetchContent")
FetchContent_Declare(
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b9012
)
FetchContent_MakeAvailable(llama-cpp)
endif()
endif()
# spdlog # spdlog
FetchContent_Declare( FetchContent_Declare(
@@ -153,16 +181,16 @@ target_sources(${PROJECT_NAME} PRIVATE
) )
# --- data_generation: llama (skipped for mock-only builds) --- # --- data_generation: llama (skipped for mock-only builds) ---
if (NOT BIERGARTEN_MOCK_ONLY) if(NOT BIERGARTEN_MOCK_ONLY)
target_sources(${PROJECT_NAME} PRIVATE target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/llama/load.cc src/data_generation/llama/load.cc
src/data_generation/llama/helpers.cc src/data_generation/llama/helpers.cc
src/data_generation/llama/generate_brewery.cc src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/infer.cc src/data_generation/llama/infer.cc
src/data_generation/llama/llama_generator.cc src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_user.cc src/data_generation/llama/generate_user.cc
) )
endif () endif()
# --- services: wikipedia --- # --- services: wikipedia ---
target_sources(${PROJECT_NAME} PRIVATE target_sources(${PROJECT_NAME} PRIVATE
@@ -189,8 +217,6 @@ target_sources(${PROJECT_NAME} PRIVATE
# 6. Include Directories, Link Libraries & Compile Definitions # 6. Include Directories, Link Libraries & Compile Definitions
target_include_directories(${PROJECT_NAME} PRIVATE target_include_directories(${PROJECT_NAME} PRIVATE
includes includes
$<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:${llama-cpp_SOURCE_DIR}/include>
$<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:${llama-cpp_SOURCE_DIR}/common>
) )
target_link_libraries(${PROJECT_NAME} PRIVATE target_link_libraries(${PROJECT_NAME} PRIVATE

View File

@@ -14,10 +14,10 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "../services/prompting/prompt_directory.h"
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_generation/prompt_formatting/prompt_formatter.h" #include "data_generation/prompt_formatting/prompt_formatter.h"
#include "data_model/models.h" #include "data_model/models.h"
#include "../services/prompting/prompt_directory.h"
struct llama_model; struct llama_model;
struct llama_context; struct llama_context;
@@ -129,6 +129,7 @@ class LlamaGenerator final : public DataGenerator {
uint32_t sampling_top_k_ = kDefaultSamplingTopK; uint32_t sampling_top_k_ = kDefaultSamplingTopK;
std::mt19937 rng_; std::mt19937 rng_;
uint32_t n_ctx_ = kDefaultContextSize; uint32_t n_ctx_ = kDefaultContextSize;
int n_gpu_layers_ = 0;
std::unique_ptr<IPromptFormatter> prompt_formatter_; std::unique_ptr<IPromptFormatter> prompt_formatter_;
std::unique_ptr<IPromptDirectory> prompt_directory_; std::unique_ptr<IPromptDirectory> prompt_directory_;
}; };

View File

@@ -3,7 +3,8 @@
/** /**
* @file data_model/models.h * @file data_model/models.h
* @brief Core data models: locations, application configuration, and generation inputs. * @brief Core data models: locations, application configuration, and generation
* inputs.
*/ */
#include <boost/program_options.hpp> #include <boost/program_options.hpp>
@@ -94,6 +95,9 @@ struct GeneratorOptions {
/// @brief Use mocked generator instead of actual LLM inference. /// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false; bool use_mocked = false;
/// @brief Number of layers to offload to GPU.
int n_gpu_layers = 0;
/// @brief Specific sampling parameters for this generator. /// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults. /// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling; std::optional<SamplingOptions> sampling;

View File

@@ -0,0 +1,67 @@
# Phase 1: Pull prebuilt binaries
FROM ghcr.io/ggml-org/llama.cpp:full-cuda AS llama-bin
# Phase 2: Building environment
FROM nvidia/cuda:12.6.3-devel-ubuntu24.04
ENV DEBIAN_FRONTEND=noninteractive \
CMAKE_GENERATOR=Ninja \
APP_ROOT=/workspace/app \
BUILD_DIR=/workspace/app/build
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
curl \
git \
libboost-json-dev \
libboost-program-options-dev \
libssl-dev \
ninja-build \
pkg-config \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install modern CMake via curl (Ubuntu 24.04 'apt' version can be laggy)
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.31.0/cmake-3.31.0-linux-x86_64.sh -o cmake.sh && \
sh cmake.sh --skip-license --prefix=/usr/local && rm cmake.sh
# Copy backends to /usr/local/lib and register with ldconfig so the
# runtime linker can resolve libllama.so, libggml.so, libggml-base.so etc.
COPY --from=llama-bin /app/lib*.so* /usr/local/lib/
RUN ldconfig
# Headers for C++ Build
RUN curl -L https://github.com/ggml-org/llama.cpp/archive/refs/tags/b9012.tar.gz -o /tmp/llama-src.tar.gz && \
tar -xzf /tmp/llama-src.tar.gz -C /tmp && \
cp -r /tmp/llama.cpp-b9012/include/* /usr/local/include/ && \
cp -r /tmp/llama.cpp-b9012/ggml/include/* /usr/local/include/ && \
rm -rf /tmp/llama-src.tar.gz /tmp/llama.cpp-b9012
ENV LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"
WORKDIR /workspace/app
COPY . .
# Build the C++ pipeline
RUN cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc)
# Co-locate GGML backend plugins with the executable.
# ggml_backend_load_all() searches the executable directory first when
# GGML_BACKEND_DIR is not set. Copying the ggml-*.so plugin files here
# ensures the loader finds them without any environment variable.
# libllama.so, libggml.so, and libggml-base.so are NOT copied here —
# those are proper shared libraries resolved via ldconfig/LD_LIBRARY_PATH.
RUN cp /usr/local/lib/libggml-cuda.so /workspace/app/build/ 2>/dev/null || true && \
cp /usr/local/lib/libggml-cpu*.so /workspace/app/build/ 2>/dev/null || true && \
cp /usr/local/lib/libggml-blas*.so /workspace/app/build/ 2>/dev/null || true && \
cp /usr/local/lib/libggml-rpc*.so /workspace/app/build/ 2>/dev/null || true
# Setup Start Script
COPY ./runpod/start.sh /usr/local/bin/biergarten-start
RUN chmod +x /usr/local/bin/biergarten-start
WORKDIR /workspace/app/build
ENTRYPOINT ["/usr/local/bin/biergarten-start"]

View File

@@ -0,0 +1,8 @@
```bash
touch runpod/start.sh
docker build \
--progress=plain \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```

View File

@@ -0,0 +1,22 @@
name: biergarten-pipeline-live
imageName: biergarten-pipeline:latest
category: NVIDIA
containerDiskInGb: 50
volumeInGb: 50
volumeMountPath: /workspace
dockerEntrypoint:
- /usr/local/bin/biergarten-start
dockerStartCmd: []
isPublic: false
isServerless: false
env:
BIERGARTEN_MODE: live
BIERGARTEN_MODEL_PATH: /workspace/models/google_gemma-4-E4B-it-Q6_K.gguf
BIERGARTEN_PROMPT_DIR: /workspace/app/build/prompts
BIERGARTEN_OUTPUT_DIR: /workspace/output
BIERGARTEN_LOG_PATH: /workspace/logs/pipeline.log
BIERGARTEN_TEMPERATURE: "1.0"
BIERGARTEN_TOP_P: "0.95"
BIERGARTEN_TOP_K: "64"
BIERGARTEN_N_CTX: "8192"
BIERGARTEN_SEED: "-1"

View File

@@ -0,0 +1,49 @@
#!/bin/bash
set -e
# Configuration / Defaults
MODEL_PATH="${BIERGARTEN_MODEL_PATH:-/workspace/models/google_gemma-4-E4B-it-Q6_K.gguf}"
OUTPUT_DIR="${BIERGARTEN_OUTPUT_DIR:-/workspace/output}"
LOG_PATH="${BIERGARTEN_LOG_PATH:-/workspace/logs/pipeline.log}"
EXECUTABLE="/workspace/app/build/biergarten-pipeline"
PROMPT_DIR="/workspace/app/build/prompts"
echo "--- Starting Biergarten Pipeline Environment Check ---"
# 1. Ensure volume mount directories exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$(dirname "$LOG_PATH")"
# 2. Check for model file
if [ ! -f "$MODEL_PATH" ]; then
echo "ERROR: Model not found at $MODEL_PATH"
echo "Current /workspace/models contents:"
ls -lh /workspace/models 2>/dev/null || echo "(directory does not exist)"
exit 1
fi
# 3. Build the command arguments
ARGS=(
"--model" "$MODEL_PATH"
"--prompt-dir" "$PROMPT_DIR"
"--output" "$OUTPUT_DIR"
"--log-path" "$LOG_PATH"
)
# Optional hyperparameters
[[ -n "$BIERGARTEN_TEMPERATURE" ]] && ARGS+=("--temperature" "$BIERGARTEN_TEMPERATURE")
[[ -n "$BIERGARTEN_TOP_P" ]] && ARGS+=("--top-p" "$BIERGARTEN_TOP_P")
[[ -n "$BIERGARTEN_TOP_K" ]] && ARGS+=("--top-k" "$BIERGARTEN_TOP_K")
[[ -n "$BIERGARTEN_N_CTX" ]] && ARGS+=("--n-ctx" "$BIERGARTEN_N_CTX")
[[ -n "$BIERGARTEN_SEED" ]] && ARGS+=("--seed" "$BIERGARTEN_SEED")
[[ -n "$BIERGARTEN_GL_LAYERS" ]] && ARGS+=("--n-gpu-layers" "$BIERGARTEN_GL_LAYERS")
# Append any extra custom args
if [[ -n "$BIERGARTEN_EXTRA_ARGS" ]]; then
ARGS+=($BIERGARTEN_EXTRA_ARGS)
fi
echo "--- Executing: $EXECUTABLE ${ARGS[*]} ---"
# Execute the binary directly, replacing the shell process
exec "$EXECUTABLE" "${ARGS[@]}"

View File

@@ -50,6 +50,8 @@ std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""), opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)." "Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked."); " Required when not using --mocked.");
opt("n-gpu-layers", prog_opts::value<int>()->default_value(0),
"Number of layers to offload to GPU");
}; };
add_sampling_options(); add_sampling_options();
@@ -85,6 +87,7 @@ std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
const bool use_mocked = var_map["mocked"].as<bool>(); const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>(); const std::string model_path = var_map["model"].as<std::string>();
const int n_gpu_layers = var_map["n-gpu-layers"].as<int>();
// Enforce mutual exclusivity before any further configuration is applied. // Enforce mutual exclusivity before any further configuration is applied.
if (use_mocked && !model_path.empty()) { if (use_mocked && !model_path.empty()) {
@@ -110,6 +113,7 @@ std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
options.generator.use_mocked = use_mocked; options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path; options.generator.model_path = model_path;
options.generator.n_gpu_layers = n_gpu_layers;
// Only populate sampling config when the user explicitly overrides at // Only populate sampling config when the user explicitly overrides at
// least one value. Leaving it as std::nullopt lets LlamaGenerator fall // least one value. Leaving it as std::nullopt lets LlamaGenerator fall

View File

@@ -89,6 +89,7 @@ LlamaGenerator::LlamaGenerator(
} }
n_ctx_ = sampling.n_ctx; n_ctx_ = sampling.n_ctx;
n_gpu_layers_ = options.generator.n_gpu_layers;
this->Load(model_path); this->Load(model_path);
} }

View File

@@ -12,6 +12,7 @@
#include <utility> #include <utility>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "ggml-backend.h"
#include "llama.h" #include "llama.h"
// Maximum batch size for decode operations. Capping the batch prevents // Maximum batch size for decode operations. Capping the batch prevents
@@ -22,7 +23,12 @@ void LlamaGenerator::Load(const std::string& model_path) {
context_.reset(); context_.reset();
model_.reset(); model_.reset();
const llama_model_params model_params = llama_model_default_params(); // Specifically load dynamic ggml backends (like CUDA) that are provided
// externally before attempting to load a model.
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers_;
LlamaGenerator::ModelHandle loaded_model( LlamaGenerator::ModelHandle loaded_model(
llama_model_load_from_file(model_path.c_str(), model_params)); llama_model_load_from_file(model_path.c_str(), model_params));
if (!loaded_model) { if (!loaded_model) {