5 Commits

Author SHA1 Message Date
Aaron Po
271c6fa99f update docs 2026-05-01 18:30:38 -04:00
Aaron Po
316fda1775 codebase formatting 2026-05-01 17:40:37 -04:00
Aaron Po
91e18888fe readability updates: remove magic numbers, update comments 2026-05-01 17:38:16 -04:00
Aaron Po
9051f55114 add prompt dir app option 2026-05-01 12:25:05 -04:00
Aaron Po
01849062d5 Refactor ApplicationOptions to separate config concerns 2026-05-01 00:40:21 -04:00
71 changed files with 980 additions and 1393 deletions

2
.gitattributes vendored
View File

@@ -1 +1 @@
archive/** linguist-vendored archive/* linguist-vendored

View File

@@ -18,7 +18,6 @@ descriptions via a local GGUF model or a deterministic mock.
- [Build](#build) - [Build](#build)
- [Model](#model) - [Model](#model)
- [Run](#run) - [Run](#run)
- [Docker / RunPod](#docker--runpod)
- [Architecture](#architecture) - [Architecture](#architecture)
- [Pipeline Stages](#pipeline-stages) - [Pipeline Stages](#pipeline-stages)
- [Key Components](#key-components) - [Key Components](#key-components)
@@ -52,7 +51,7 @@ step.
### Build ### Build
Requirements: C++20 compiler, CMake 3.31+, OpenSSL, Boost (JSON and Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and
ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system
SQLite package is required. SQLite package is required.
@@ -61,16 +60,6 @@ cmake -S . -B build
cmake --build build cmake --build build
``` ```
CMake automatically detects whether a compatible llama.cpp installation is
present on the system (`libllama`, `libggml`, `libggml-base`, and `llama.h`
visible on the default search paths). If found, it links against those
libraries and skips the FetchContent build. If not found, it fetches and builds
llama.cpp from source at tag `b9012`. No additional flags are required in
either case.
Metal is enabled automatically on Apple Silicon. CUDA or HIP/ROCm is detected
automatically on Linux when the relevant toolkit is present.
### Model ### Model
> Skip this step if you only need `--mocked`. > Skip this step if you only need `--mocked`.
@@ -85,124 +74,33 @@ curl -L \
### Run ### Run
Run from `build/` so the copied `locations.json` and `prompts/` are available. Run from `build/` so the copied `locations.json` and `prompts/` are available.
Each run writes a fresh dated SQLite file such as Each run also writes a fresh dated SQLite file such as
`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. `biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory.
```bash ```bash
./biergarten-pipeline --mocked ./biergarten-pipeline --mocked
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
./biergarten-pipeline \
--model ../models/google_gemma-4-E4B-it-Q6_K.gguf \
--prompt-dir prompts \
--temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
``` ```
#### CLI Flags #### CLI Flags
| Flag | Purpose | | Flag | Purpose |
| --------------- | ---------------------------------------------------------------------------------------------------- | | --------------- | ------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. | | `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. | | `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--prompt-dir` | Directory containing prompt files (e.g. `BREWERY_GENERATION.md`). Required unless `--mocked` is set. | | `--temperature` | Sampling temperature. Default: `1.0`. |
| `--output, -o` | Directory for generated SQLite artifacts. Default: `output`. | | `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--log-path` | Path for application logs. Default: `pipeline.log`. | | `--top-k` | Top-k sampling. Default: `64`. |
| `--temperature` | Sampling temperature. Default: `1.0`. | | `--n-ctx` | Context window size. Default: `8192`. |
| `--top-p` | Nucleus sampling. Default: `0.95`. | | `--seed` | Random seed. Default: `-1` (random at runtime). |
| `--top-k` | Top-k sampling. Default: `64`. | | `--help, -h` | Print usage and exit. |
| `--n-ctx` | Context window size. Default: `8192`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). |
| `--help, -h` | Print usage and exit. |
`--mocked` and `--model` are mutually exclusive. Omitting both exits with an `--mocked` and `--model` are mutually exclusive. Omitting both exits with an
error before the pipeline starts. Sampling flags are ignored when `--mocked` is error before the pipeline starts. Sampling flags are ignored when `--mocked` is
set. set.
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after The post-build step copies `prompts/` into `build/prompts/`. Rebuild after
editing any prompt file. editing `prompts/system.md`.
---
## Docker / RunPod
The `tooling/pipeline/runpod/` directory contains a GPU-ready container
configuration for running the pipeline on RunPod or any Docker host with an
NVIDIA GPU.
### How it works
The container uses a two-stage build. The first stage pulls prebuilt
`libllama`, `libggml`, and backend plugin libraries (including `libggml-cuda.so`
and the CPU variant plugins) from `ghcr.io/ggml-org/llama.cpp:full-cuda`. The
second stage copies those libraries into `/usr/local/lib` and runs `ldconfig` so
the dynamic linker and `dlopen` calls from `ggml_backend_load_all()` can resolve
the CUDA backend plugin at runtime. llama.cpp headers are cloned at the matching
tag and installed into `/usr/local/include`. CMake auto-detects both and skips
the FetchContent source build entirely, keeping image build times short.
`GGML_BACKEND_PATH` is set to `/usr/local/lib` so llama.cpp knows where to scan
for backend plugins.
### Build the image
Run from the `tooling/pipeline/` directory (the CMake project root), not from
inside `runpod/`, so the `COPY . .` step picks up the full project context.
```bash
docker build -t biergarten-pipeline:latest -f runpod/Dockerfile .
```
To monitor the full build output and confirm CMake selects the system llama.cpp:
```bash
docker build \
--progress=plain \
--no-cache \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```
Look for `[biergarten] Found system llama.cpp — skipping FetchContent` in the
output to confirm the fast path was taken.
### Run in mocked mode
No model or GPU required. Useful for validating the pipeline logic and SQLite
export path.
```bash
docker run --rm \
-e BIERGARTEN_MODE=mocked \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
### Run in live mode
Mount your GGUF model before starting. The container validates the model path
before launching the binary.
```bash
docker run --rm \
--runtime=nvidia \
-e BIERGARTEN_MODE=live \
-e GGML_BACKEND_PATH="/usr/local/lib/libggml-cuda.so" \
-v "$PWD/models:/workspace/models" \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
The model must be present at `./models/google_gemma-4-E4B-it-Q6_K.gguf` on the
host. See [Model](#model) above for the download command.
### RunPod deployment
Use a GPU pod template. Mount persistent storage for `/workspace/models`,
`/workspace/output`, and `/workspace/logs`. Set `BIERGARTEN_MODE=live` in the
template environment. See `tooling/pipeline/runpod/pod-template.yaml` for a
starter template.
--- ---
@@ -299,18 +197,16 @@ code, latitude, and longitude for each entry.
## Tech Stack ## Tech Stack
- C++20 - C++20
- CMake 3.31+ - CMake 3.24+
- Boost.JSON, Boost.ProgramOptions, Boost.DI - Boost.JSON, Boost.ProgramOptions, Boost.DI
- spdlog - spdlog
- cpp-httplib (with OpenSSL) - libcurl
- SQLite amalgamation fetched and compiled via CMake FetchContent - SQLite amalgamation fetched and compiled via CMake FetchContent
- llama.cpp (auto-detected from system install or fetched via FetchContent) - llama.cpp
- Docker with NVIDIA CUDA 12.6 base image for GPU container builds
- RunPod for cloud GPU inference
The build fetches Boost.DI, spdlog, and SQLite via CMake. llama.cpp is fetched The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is
only when a system installation is not detected. Metal is enabled on Apple enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit
Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present. is present.
> **Code Style:** Modern C++20 throughout — RAII for ownership, > **Code Style:** Modern C++20 throughout — RAII for ownership,
> `std::unique_ptr` for injected dependencies, `std::optional` for parse > `std::unique_ptr` for injected dependencies, `std::optional` for parse
@@ -322,7 +218,7 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
## Tested Hardware ## Tested Hardware
### ARM macOS M1 Pro ### ARM macOS - M1 Pro
| | | | | |
| --------- | --------------------------------- | | --------- | --------------------------------- |
@@ -333,7 +229,7 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with Metal | | Inference | llama.cpp with Metal |
### x86_64 Linux NVIDIA RTX 2000 ### x86_64 Linux - NVIDIA RTX 2000
| | | | | |
| --------- | ------------------------------ | | --------- | ------------------------------ |
@@ -344,15 +240,6 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with CUDA 12.x | | Inference | llama.cpp with CUDA 12.x |
### x86_64 Linux — Docker / RunPod (NVIDIA CUDA)
| | |
| --------- | ------------------------------------------- |
| Host | RunPod GPU pod |
| Base | nvidia/cuda:12.6.3-devel-ubuntu24.04 |
| Model | Gemma 4 E4B Q6_K |
| Inference | llama.cpp prebuilt CUDA backends via dlopen |
--- ---
## Fixture Strategy ## Fixture Strategy
@@ -373,9 +260,8 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| `includes/` | Public headers and shared models. | | `includes/` | Public headers and shared models. |
| `src/` | Implementation files. | | `src/` | Implementation files. |
| `locations.json` | Curated city input copied into the build tree. | | `locations.json` | Curated city input copied into the build tree. |
| `prompts/` | System prompts used by the model-backed path. | | `prompts/` | System prompt used by the model-backed path. |
| `diagrams/` | Architecture and pipeline diagrams. | | `diagrams/` | Architecture and pipeline diagrams. |
| `tooling/pipeline/runpod/` | Dockerfile, launcher, and RunPod pod template. |
| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. | | `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. |
--- ---
@@ -390,7 +276,6 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
- `src/data_generation/llama/` — local inference, prompt loading, output - `src/data_generation/llama/` — local inference, prompt loading, output
validation. validation.
- `src/data_generation/mock/` — deterministic fallback. - `src/data_generation/mock/` — deterministic fallback.
- `tooling/pipeline/runpod/` — container build and runtime launcher.
--- ---

View File

@@ -29,7 +29,7 @@ if (Are arguments valid?) then (no)
else (yes) else (yes)
endif endif
:Init OpenSSL global state & LlamaBackendState; :Init CurlGlobalState & LlamaBackendState;
:di::make_injector(...); :di::make_injector(...);
:injector.create<std::unique_ptr<BiergartenDataGenerator>>(); :injector.create<std::unique_ptr<BiergartenDataGenerator>>();
:BiergartenDataGenerator::Run(); :BiergartenDataGenerator::Run();

View File

@@ -52,7 +52,7 @@ interface WebClient <<interface>> {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class HttpWebClient { class CURLWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -130,7 +130,7 @@ BiergartenDataGenerator *-- IExportService : owns
IEnrichmentService <|.. WikipediaService : implements IEnrichmentService <|.. WikipediaService : implements
WikipediaService *-- WebClient : owns WikipediaService *-- WebClient : owns
WebClient <|.. HttpWebClient : implements WebClient <|.. CURLWebClient : implements
DataGenerator <|.. MockGenerator : implements DataGenerator <|.. MockGenerator : implements
DataGenerator <|.. LlamaGenerator : implements DataGenerator <|.. LlamaGenerator : implements

View File

@@ -13,7 +13,7 @@ if (Invalid args?) then (yes)
stop stop
else (no) else (no)
endif endif
:Init OpenSSL global state & LlamaBackendState; :Init CurlGlobalState & LlamaBackendState;
:Build DI injector; :Build DI injector;
:Initialize SqliteExportService; :Initialize SqliteExportService;

View File

@@ -356,7 +356,7 @@ package "Infrastructure: Enrichment" {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class HttpWebClient { class CURLWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -520,7 +520,7 @@ CheckinDistributionStrategy <|.. RandomCheckinStrategy
FollowGenerationStrategy <|.. RandomFollowStrategy FollowGenerationStrategy <|.. RandomFollowStrategy
FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy
EnrichmentService <|.. WikipediaService EnrichmentService <|.. WikipediaService
WebClient <|.. HttpWebClient WebClient <|.. CURLWebClient
DataGenerator <|.. MockGenerator DataGenerator <|.. MockGenerator
DataGenerator <|.. LlamaGenerator DataGenerator <|.. LlamaGenerator
PromptFormatter <|.. Gemma4JinjaPromptFormatter PromptFormatter <|.. Gemma4JinjaPromptFormatter

View File

@@ -1,9 +0,0 @@
build/
cmake-build-debug/
.git/
.idea/
**/*.sqlite
**/*.log
**/*.sqlite3
**/*.db

View File

@@ -1,255 +1,178 @@
cmake_minimum_required(VERSION 3.31) cmake_minimum_required(VERSION 3.24)
project(biergarten-pipeline) project(biergarten-pipeline)
# Set policy to allow FetchContent_Populate for header-only libraries set(CMAKE_POLICY_VERSION_MINIMUM 3.5 CACHE STRING "" FORCE)
# that have outdated CMakeLists.txt files
cmake_policy(SET CMP0169 OLD)
# 1. Build Options # =============================================================================
# 1. Platform & GPU Detection
option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF) # =============================================================================
if(BIERGARTEN_MOCK_ONLY) if(WIN32)
message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.") message(FATAL_ERROR "[biergarten] Windows is currently not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif()
# 2. Platform & GPU Detection
if(NOT UNIX)
message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif() endif()
if(APPLE) if(APPLE)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64") if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64")
message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.") message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.")
set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE) set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE)
else() else()
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.") message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE) set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif() endif()
else() elseif(UNIX AND NOT APPLE)
find_package(CUDAToolkit QUIET) find_package(CUDAToolkit QUIET)
find_package(hip CONFIG QUIET) find_package(HIP QUIET)
if(CUDAToolkit_FOUND) if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.") message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE) set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native) set(CMAKE_CUDA_ARCHITECTURES native)
elseif(hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm") elseif(HIP_FOUND OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.") message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE) set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else() else()
message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.") message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.")
endif() endif()
endif() endif()
# 3. Project-wide Settings # =============================================================================
# 2. Project-wide Settings (Standard & Optimization)
# =============================================================================
set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
# Release Build Optimization: Aggressive (-O3), Arch-specific, and LTO
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g") set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g")
# 4. Dependencies # =============================================================================
# 3. Dependencies
# =============================================================================
include(FetchContent) include(FetchContent)
# Boost (system install — via dnf/brew) find_package(CURL QUIET)
if(NOT CURL_FOUND)
message(FATAL_ERROR "[biergarten] libcurl not found. Install it (e.g. 'sudo dnf install libcurl-devel').")
endif()
# Require system Boost for JSON and Program Options to speed up build times
find_package(Boost REQUIRED COMPONENTS json program_options) find_package(Boost REQUIRED COMPONENTS json program_options)
# Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency)
# Header-only library, so we only fetch without invoking its CMakeLists.txt
FetchContent_Declare( FetchContent_Declare(
boost-di sqlite_amalgamation
GIT_REPOSITORY https://github.com/boost-ext/di.git URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
GIT_TAG v1.3.0 URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
GIT_SHALLOW TRUE
) )
FetchContent_GetProperties(boost-di) FetchContent_GetProperties(sqlite_amalgamation)
if(NOT boost-di_POPULATED) if(NOT sqlite_amalgamation_POPULATED)
FetchContent_Populate(boost-di) FetchContent_Populate(sqlite_amalgamation)
endif() endif()
add_library(boost_di INTERFACE)
add_library(boost::di ALIAS boost_di)
target_include_directories(boost_di INTERFACE
$<BUILD_INTERFACE:${boost-di_SOURCE_DIR}/include>
)
# SQLite amalgamation
FetchContent_Declare(
sqlite_amalgamation
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
EXCLUDE_FROM_ALL
)
FetchContent_MakeAvailable(sqlite_amalgamation)
if(NOT TARGET sqlite3) if(NOT TARGET sqlite3)
add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c) add_library(sqlite3 STATIC
target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR}) ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c
target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1) )
target_include_directories(sqlite3 PUBLIC
${sqlite_amalgamation_SOURCE_DIR}
)
target_compile_definitions(sqlite3 PUBLIC
SQLITE_THREADSAFE=1
)
endif() endif()
# llama.cpp — skipped for mock-only builds
if(NOT BIERGARTEN_MOCK_ONLY)
find_library(LLAMA_LIB NAMES llama)
find_library(GGML_LIB NAMES ggml)
find_library(GGML_BASE_LIB NAMES ggml-base)
find_path(LLAMA_INC_DIR NAMES llama.h PATH_SUFFIXES include)
if(LLAMA_LIB AND GGML_LIB AND GGML_BASE_LIB AND LLAMA_INC_DIR)
message(STATUS "[biergarten] Found system llama.cpp — skipping FetchContent")
add_library(llama SHARED IMPORTED)
set_target_properties(llama PROPERTIES
IMPORTED_LOCATION "${LLAMA_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INC_DIR}"
INTERFACE_LINK_LIBRARIES "${GGML_LIB};${GGML_BASE_LIB}"
)
else()
message(STATUS "[biergarten] System llama.cpp not found — fetching via FetchContent")
FetchContent_Declare(
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b9012
)
FetchContent_MakeAvailable(llama-cpp)
endif()
endif()
# spdlog
FetchContent_Declare( FetchContent_Declare(
spdlog llama-cpp
GIT_REPOSITORY https://github.com/gabime/spdlog.git GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG v1.15.3 GIT_TAG b8742
)
FetchContent_MakeAvailable(llama-cpp)
FetchContent_Declare(
boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0
)
FetchContent_MakeAvailable(boost-di)
if(TARGET Boost.DI AND NOT TARGET boost::di)
add_library(boost::di ALIAS Boost.DI)
endif()
FetchContent_Declare(
spdlog
GIT_REPOSITORY https://github.com/gabime/spdlog.git
GIT_TAG v1.15.3
) )
FetchContent_MakeAvailable(spdlog) FetchContent_MakeAvailable(spdlog)
# cpp-httplib — header-only HTTP/HTTPS client replacing libcurl. # =============================================================================
# OpenSSL is required for HTTPS (Wikipedia API). find_package locates # 4. Sources
# libssl/libcrypto; HTTPLIB_REQUIRE_OPENSSL causes a hard build failure # =============================================================================
# if OpenSSL is absent rather than silently producing an HTTP-only binary. set(SOURCES
find_package(OpenSSL REQUIRED) src/main.cc
FetchContent_Declare( src/biergarten_data_generator/biergarten_data_generator.cc
cpp-httplib src/biergarten_data_generator/run.cc
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib.git src/biergarten_data_generator/query_cities_with_countries.cc
GIT_TAG v0.43.2 src/biergarten_data_generator/generate_breweries.cc
GIT_SHALLOW TRUE src/biergarten_data_generator/log_results.cc
SYSTEM src/services/wikipedia/wikipedia_service.cc
) src/services/wikipedia/get_summary.cc
set(HTTPLIB_REQUIRE_OPENSSL ON CACHE BOOL "Require OpenSSL for cpp-httplib" FORCE) src/services/wikipedia/fetch_extract.cc
FetchContent_MakeAvailable(cpp-httplib) src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/build_database_path.cc
# 5. Executable & Sources src/services/sqlite/process_record.cc
add_executable(${PROJECT_NAME} src/services/sqlite/initialize.cc
includes/services/enrichment/mock_enrichment.h) src/services/sqlite/finalize.cc
src/web_client/curl_global_state.cc
# --- Entry point --- src/web_client/curl_web_client_get.cc
target_sources(${PROJECT_NAME} PRIVATE src/web_client/curl_web_client_url_encode.cc
src/main.cc src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/generate_user.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/load.cc
src/services/prompt_directory.cc
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
src/data_generation/mock/deterministic_hash.cc
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/json_handling/json_loader.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cpp
src/services/sqlite/helpers/sqlite_statement_helpers.cpp
) )
# --- json_handling --- # =============================================================================
target_sources(${PROJECT_NAME} PRIVATE # 5. Target
src/json_handling/json_loader.cc # =============================================================================
) add_executable(${PROJECT_NAME} ${SOURCES})
# --- application_options ---
target_sources(${PROJECT_NAME} PRIVATE
src/application_options/parse_arguments.cc
)
# --- biergarten_data_generator ---
target_sources(${PROJECT_NAME} PRIVATE
src/biergarten_data_generator/log_results.cc
src/biergarten_data_generator/biergarten_data_generator.cc
src/biergarten_data_generator/generate_breweries.cc
src/biergarten_data_generator/run.cc
src/biergarten_data_generator/query_cities_with_countries.cc
)
# --- web_client ---
target_sources(${PROJECT_NAME} PRIVATE
src/web_client/http_web_client.cc
)
# --- data_generation: prompt_formatting ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
)
# --- data_generation: mock ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/data_generation/mock/deterministic_hash.cc
)
# --- data_generation: llama (skipped for mock-only builds) ---
if(NOT BIERGARTEN_MOCK_ONLY)
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/llama/load.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_user.cc
)
endif()
# --- services: wikipedia ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/enrichment/wikipedia/wikipedia_service.cc
src/services/enrichment/wikipedia/fetch_extract.cc
src/services/enrichment/wikipedia/get_summary.cc
)
# --- services: sqlite ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/sqlite/process_record.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/finalize.cc
src/services/sqlite/initialize.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cc
src/services/sqlite/helpers/sqlite_statement_helpers.cc
)
# --- services (top-level) ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/prompt_directory.cc
)
# 6. Include Directories, Link Libraries & Compile Definitions
target_include_directories(${PROJECT_NAME} PRIVATE target_include_directories(${PROJECT_NAME} PRIVATE
includes includes
${llama-cpp_SOURCE_DIR}/include
${llama-cpp_SOURCE_DIR}/common
) )
target_link_libraries(${PROJECT_NAME} PRIVATE target_link_libraries(${PROJECT_NAME} PRIVATE
$<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:llama> llama
boost::di boost::di
Boost::json Boost::json
Boost::program_options Boost::program_options
spdlog::spdlog spdlog::spdlog
sqlite3 sqlite3
httplib::httplib CURL::libcurl
OpenSSL::SSL
OpenSSL::Crypto
) )
target_compile_definitions(${PROJECT_NAME} PRIVATE # =============================================================================
# Defined when -DBIERGARTEN_MOCK_ONLY=ON — skips llama.cpp entirely. # 6. Runtime Assets
# Use #ifdef BIERGARTEN_MOCK_ONLY in source to guard llama-specific code. # =============================================================================
$<$<BOOL:${BIERGARTEN_MOCK_ONLY}>:BIERGARTEN_MOCK_ONLY>
# Defined for Debug configuration builds.
# Use #ifdef DEBUG in source to enable debug-only behaviour (e.g. verbose logging).
$<$<CONFIG:Debug>:DEBUG>
)
# 7. Runtime Assets
configure_file( configure_file(
${CMAKE_SOURCE_DIR}/locations.json ${CMAKE_SOURCE_DIR}/locations.json
${CMAKE_BINARY_DIR}/locations.json ${CMAKE_BINARY_DIR}/locations.json
COPYONLY COPYONLY
) )
add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_directory COMMAND ${CMAKE_COMMAND} -E copy_directory
${CMAKE_SOURCE_DIR}/prompts ${CMAKE_SOURCE_DIR}/prompts
${CMAKE_BINARY_DIR}/prompts ${CMAKE_BINARY_DIR}/prompts
) )

View File

@@ -11,9 +11,11 @@
#include <vector> #include <vector>
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_model/generated_models.h" #include "data_model/enriched_city.h"
#include "services/database/export_service.h" #include "data_model/generated_brewery.h"
#include "services/enrichment/enrichment_service.h" #include "data_model/location.h"
#include "services/enrichment_service.h"
#include "services/export_service.h"
/** /**
* @brief Main data generator class for the Biergarten pipeline. * @brief Main data generator class for the Biergarten pipeline.
@@ -32,8 +34,7 @@ class BiergartenDataGenerator {
*/ */
BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service, BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator, std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter, std::unique_ptr<IExportService> exporter);
const ApplicationOptions& application_options);
/** /**
* @brief Run the data generation pipeline. * @brief Run the data generation pipeline.
@@ -57,14 +58,12 @@ class BiergartenDataGenerator {
/// @brief Storage backend for generated brewery records. /// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_; std::unique_ptr<IExportService> exporter_;
const ApplicationOptions application_options_;
/** /**
* @brief Load locations from JSON and sample cities. * @brief Load locations from JSON and sample cities.
* *
* @return Vector of sampled locations capped at 50 entries. * @return Vector of sampled locations capped at 50 entries.
*/ */
std::vector<Location> QueryCitiesWithCountries(); static std::vector<Location> QueryCitiesWithCountries();
/** /**
* @brief Generate breweries for enriched cities. * @brief Generate breweries for enriched cities.

View File

@@ -8,7 +8,9 @@
#include <string> #include <string>
#include "data_model/generated_models.h" #include "data_model/brewery_result.h"
#include "data_model/location.h"
#include "data_model/user_result.h"
/** /**
* @brief Interface for data generator implementations. * @brief Interface for data generator implementations.

View File

@@ -14,10 +14,10 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "../services/prompting/prompt_directory.h"
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_generation/prompt_formatting/prompt_formatter.h" #include "data_generation/prompt_formatting/prompt_formatter.h"
#include "data_model/models.h" #include "data_model/application_options.h"
#include "services/prompt_directory.h"
struct llama_model; struct llama_model;
struct llama_context; struct llama_context;
@@ -129,7 +129,6 @@ class LlamaGenerator final : public DataGenerator {
uint32_t sampling_top_k_ = kDefaultSamplingTopK; uint32_t sampling_top_k_ = kDefaultSamplingTopK;
std::mt19937 rng_; std::mt19937 rng_;
uint32_t n_ctx_ = kDefaultContextSize; uint32_t n_ctx_ = kDefaultContextSize;
int n_gpu_layers_ = 0;
std::unique_ptr<IPromptFormatter> prompt_formatter_; std::unique_ptr<IPromptFormatter> prompt_formatter_;
std::unique_ptr<IPromptDirectory> prompt_directory_; std::unique_ptr<IPromptDirectory> prompt_directory_;
}; };

View File

@@ -12,7 +12,7 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "data_model/generated_models.h" #include "data_model/brewery_result.h"
struct llama_vocab; struct llama_vocab;
using llama_token = int32_t; using llama_token = int32_t;

View File

@@ -1,5 +1,4 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_ #pragma once
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -14,5 +13,3 @@ class Gemma4JinjaPromptFormatter final : public IPromptFormatter {
[[nodiscard]] std::string Format(std::string_view system_prompt, [[nodiscard]] std::string Format(std::string_view system_prompt,
std::string_view user_prompt) const override; std::string_view user_prompt) const override;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_

View File

@@ -1,5 +1,4 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_ #pragma once
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -16,5 +15,3 @@ class IPromptFormatter {
[[nodiscard]] virtual std::string Format( [[nodiscard]] virtual std::string Format(
std::string_view system_prompt, std::string_view user_prompt) const = 0; std::string_view system_prompt, std::string_view user_prompt) const = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_

View File

@@ -0,0 +1,72 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
/**
* @file data_model/application_options.h
* @brief Program options for the Biergarten pipeline application.
*/
#include <cstdint>
#include <filesystem>
#include <optional>
#include <string>
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_

View File

@@ -0,0 +1,22 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
/**
* @file data_model/brewery_location.h
* @brief Non-owning brewery location input.
*/
#include <string_view>
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_

View File

@@ -0,0 +1,28 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
/**
* @file data_model/brewery_result.h
* @brief Generated brewery payload.
*/
#include <string>
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_

View File

@@ -0,0 +1,21 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
/**
* @file data_model/enriched_city.h
* @brief Enriched city data with Wikipedia context.
*/
#include <string>
#include "data_model/location.h"
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_

View File

@@ -0,0 +1,20 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
/**
* @file data_model/generated_brewery.h
* @brief Helper struct to store generated brewery data.
*/
#include "data_model/brewery_result.h"
#include "data_model/location.h"
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_

View File

@@ -1,66 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
/**
* @file data_model/generated_models.h
* @brief Generated output models from the pipeline: brewery/user results, enriched data,
* and complete generation results.
*/
#include <string>
#include "data_model/models.h"
// ============================================================================
// Generation Output Models
// ============================================================================
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
// ============================================================================
// Pipeline Data Models
// ============================================================================
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_

View File

@@ -0,0 +1,13 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
/**
* @file data_model/generation_models.h
* @brief Convenience include for shared generation payload models.
*/
#include "data_model/brewery_location.h"
#include "data_model/brewery_result.h"
#include "data_model/user_result.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_

View File

@@ -0,0 +1,41 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
/**
* @file data_model/location.h
* @brief Location data model used throughout generation pipeline.
*/
#include <string>
#include <vector>
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_

View File

@@ -1,141 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
/**
* @file data_model/models.h
* @brief Core data models: locations, application configuration, and generation
* inputs.
*/
#include <boost/program_options.hpp>
#include <cstdint>
#include <filesystem>
#include <optional>
#include <string>
#include <string_view>
#include <vector>
namespace prog_opts = boost::program_options;
// ============================================================================
// Location Models
// ============================================================================
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
// ============================================================================
// Configuration Models
// ============================================================================
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
/// @brief Number of layers to offload to GPU.
int n_gpu_layers = 0;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
/// @brief Number of locations to sample from the dataset
/// More locations -> more users/more breweries
uint32_t location_count;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
// ============================================================================
// Function Declarations
// ============================================================================
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv);
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_

View File

@@ -0,0 +1,12 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
/**
* @file data_model/pipeline_models.h
* @brief Convenience include for pipeline-specific data models.
*/
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_

View File

@@ -0,0 +1,22 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
/**
* @file data_model/user_result.h
* @brief Generated user profile payload.
*/
#include <string>
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_

View File

@@ -9,7 +9,7 @@
#include <filesystem> #include <filesystem>
#include <vector> #include <vector>
#include "data_model/models.h" #include "data_model/location.h"
/// @brief Loads curated world locations from a JSON file into memory. /// @brief Loads curated world locations from a JSON file into memory.
class JsonLoader { class JsonLoader {

View File

@@ -1,10 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "sqlite_connection_helpers.h"
#include "sqlite_handle_types.h"
#include "sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_
/** /**
* @file services/date_time_provider.h * @file services/date_time_provider.h
@@ -63,4 +63,4 @@ class SystemDateTimeProvider final : public IDateTimeProvider {
} }
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_

View File

@@ -1,35 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#include <chrono>
/**
* @file services/timer.h
* @brief Simple timer utility for measuring elapsed time.
*/
class Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
public:
Timer(const Timer&) = delete;
Timer& operator=(const Timer&) = delete;
Timer(Timer&&) = delete;
Timer& operator=(Timer&&) = delete;
Timer() = default;
~Timer() = default;
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
[[nodiscard]] int64_t Reset() {
auto previous_elapsed = Elapsed();
start_time = std::chrono::steady_clock::now();
return previous_elapsed;
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_

View File

@@ -1,17 +0,0 @@
//
// Created by aaronpo on 13/05/2026.
//
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#include <string>
#include "enrichment_service.h"
class MockEnrichmentService final : public IEnrichmentService {
public:
std::string GetLocationContext(const Location& /*loc*/) override {
return {};
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_
/** /**
* @file services/enrichment_service.h * @file services/enrichment_service.h
@@ -8,7 +8,7 @@
#include <string> #include <string>
#include "data_model/models.h" #include "data_model/location.h"
/** /**
* @brief Interface for services that can enrich a location with context. * @brief Interface for services that can enrich a location with context.
@@ -27,4 +27,4 @@ class IEnrichmentService {
virtual std::string GetLocationContext(const Location& loc) = 0; virtual std::string GetLocationContext(const Location& loc) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_
/** /**
* @file services/export_service.h * @file services/export_service.h
@@ -8,7 +8,7 @@
#include <cstdint> #include <cstdint>
#include "data_model/generated_models.h" #include "data_model/generated_brewery.h"
/** /**
* @brief Interface for services that persist generated brewery records. * @brief Interface for services that persist generated brewery records.
@@ -39,4 +39,4 @@ class IExportService {
virtual void Finalize() = 0; virtual void Finalize() = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_
/** /**
* @file services/prompt_directory.h * @file services/prompt_directory.h
@@ -73,4 +73,4 @@ class PromptDirectory final : public IPromptDirectory {
std::unordered_map<std::string, std::string> cache_; std::unordered_map<std::string, std::string> cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_
/** /**
* @file services/sqlite_connection_helpers.h * @file services/sqlite_connection_helpers.h
@@ -12,7 +12,7 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "sqlite_handle_types.h" #include "services/sqlite_handle_types.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -27,4 +27,4 @@ void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept;
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_
/** /**
* @file services/sqlite_export_service.h * @file services/sqlite_export_service.h
@@ -11,10 +11,10 @@
#include <string> #include <string>
#include <unordered_map> #include <unordered_map>
#include "data_model/models.h" #include "data_model/application_options.h"
#include "../datetime/date_time_provider.h" #include "services/date_time_provider.h"
#include "export_service.h" #include "services/export_service.h"
#include "sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
/** /**
* @brief Persists generated brewery records into a fresh SQLite database. * @brief Persists generated brewery records into a fresh SQLite database.
@@ -57,4 +57,4 @@ class SqliteExportService final : public IExportService {
std::unordered_map<std::string, sqlite3_int64> location_cache_; std::unordered_map<std::string, sqlite3_int64> location_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_

View File

@@ -0,0 +1,10 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "services/sqlite_connection_helpers.h"
#include "services/sqlite_handle_types.h"
#include "services/sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_
/** /**
* Shared handle and parameter type declarations used by SQLite helper units. * Shared handle and parameter type declarations used by SQLite helper units.
@@ -33,4 +33,4 @@ struct BindParam {
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_
/** /**
* @file services/sqlite_statement_helpers.h * @file services/sqlite_statement_helpers.h
@@ -13,7 +13,7 @@
#include <string_view> #include <string_view>
#include <vector> #include <vector>
#include "sqlite_handle_types.h" #include "services/sqlite_handle_types.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -113,4 +113,4 @@ std::string SerializeVector(const std::vector<std::string>& str_vec);
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_
/** /**
* @file services/wikipedia_service.h * @file services/wikipedia_service.h
@@ -11,14 +11,14 @@
#include <string_view> #include <string_view>
#include <unordered_map> #include <unordered_map>
#include "enrichment_service.h" #include "services/enrichment_service.h"
#include "web_client/web_client.h" #include "web_client/web_client.h"
/// @brief Provides Wikipedia summary lookups backed by cached raw extracts. /// @brief Provides Wikipedia summary lookups backed by cached raw extracts.
class WikipediaEnrichmentService final : public IEnrichmentService { class WikipediaService final : public IEnrichmentService {
public: public:
/// @brief Creates a new Wikipedia service with the provided web client. /// @brief Creates a new Wikipedia service with the provided web client.
explicit WikipediaEnrichmentService(std::unique_ptr<WebClient> client); explicit WikipediaService(std::unique_ptr<WebClient> client);
/// @brief Returns the Wikipedia-derived context for a location. /// @brief Returns the Wikipedia-derived context for a location.
[[nodiscard]] std::string GetLocationContext(const Location& loc) override; [[nodiscard]] std::string GetLocationContext(const Location& loc) override;
@@ -30,4 +30,4 @@ class WikipediaEnrichmentService final : public IEnrichmentService {
std::unordered_map<std::string, std::string> extract_cache_; std::unordered_map<std::string, std::string> extract_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_

View File

@@ -0,0 +1,54 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
/**
* @file web_client/curl_web_client.h
* @brief libcurl-based WebClient implementation.
*/
#include "web_client/web_client.h"
/**
* @brief RAII wrapper for curl_global_init and curl_global_cleanup.
*
* Create one instance in application startup before using libcurl and keep it
* alive for application lifetime.
*/
class CurlGlobalState {
public:
/// @brief Initializes global libcurl state.
CurlGlobalState();
/// @brief Cleans up global libcurl state.
~CurlGlobalState();
/// @brief Non-copyable type.
CurlGlobalState(const CurlGlobalState&) = delete;
/// @brief Non-copyable type.
CurlGlobalState& operator=(const CurlGlobalState&) = delete;
};
/**
* @brief WebClient implementation backed by libcurl.
*/
class CURLWebClient : public WebClient {
public:
/**
* @brief Executes an HTTP GET request.
*
* @param url Request URL.
* @return Response body.
*/
std::string Get(const std::string& url) override;
/**
* @brief URL-encodes a string value.
*
* @param value Raw value.
* @return URL-encoded string.
*/
std::string UrlEncode(const std::string& value) override;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_

View File

@@ -1,49 +0,0 @@
/**
* @file web_client/http_web_client.h
* @brief cpp-httplib implementation of the WebClient interface.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#include "web_client/web_client.h"
#include <string>
/**
* @brief WebClient implementation backed by cpp-httplib.
*
* Supports HTTP and HTTPS (requires OpenSSL; see HTTPLIB_REQUIRE_OPENSSL
* in CMakeLists.txt).
*
* URL parsing splits a full URL into origin (scheme://host[:port]) and
* path + query so that httplib::Client can be constructed correctly.
* A new client instance is created per request because the client is
* bound to a single origin at construction time.
*/
class HttpWebClient final : public WebClient {
public:
HttpWebClient() = default;
~HttpWebClient() override = default;
/**
* @brief Executes a blocking HTTP/HTTPS GET request against a full URL.
*
* @param url Fully-qualified URL, e.g. "https://en.wikipedia.org/api/rest_v1/page/summary/Berlin"
* @return Response body on HTTP 2xx; throws std::runtime_error otherwise.
*/
std::string Get(const std::string& url) override;
/**
* @brief Percent-encodes a single URI component (query parameter value or
* path segment). Delegates to httplib::encode_uri_component().
*
* @param value Raw string to encode.
* @return Percent-encoded string safe for use in a URL.
*/
std::string EncodeURL(const std::string& value) override;
};
#endif

View File

@@ -30,7 +30,7 @@ class WebClient {
* @param value Raw string value. * @param value Raw string value.
* @return Encoded value safe for URL usage. * @return Encoded value safe for URL usage.
*/ */
virtual std::string EncodeURL(const std::string& value) = 0; virtual std::string UrlEncode(const std::string& value) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_

View File

@@ -1,9 +0,0 @@
# Ignore model files!
*.gguf
*.bin
models/
weights/
# Ignore local build folders
build/
.git/

View File

@@ -1,72 +0,0 @@
# --- Stage 1: Build Environment (The "Heavy" Stage) ---
FROM nvidia/cuda:12.6.3-devel-ubuntu24.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive \
CMAKE_GENERATOR=Ninja
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential ca-certificates curl git libboost-json-dev \
libboost-program-options-dev libssl-dev ninja-build pkg-config zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install modern CMake
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.31.0/cmake-3.31.0-linux-x86_64.sh -o cmake.sh && \
sh cmake.sh --skip-license --prefix=/usr/local && rm cmake.sh
# Get headers for C++ build
RUN curl -L https://github.com/ggml-org/llama.cpp/archive/refs/tags/b9012.tar.gz -o /tmp/llama-src.tar.gz && \
tar -xzf /tmp/llama-src.tar.gz -C /tmp && \
cp -r /tmp/llama.cpp-b9012/include/* /usr/local/include/ && \
cp -r /tmp/llama.cpp-b9012/ggml/include/* /usr/local/include/
# Pull llama.cpp binaries to use during build if needed
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
WORKDIR /app
COPY . .
# Build the C++ pipeline
RUN cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc)
# --- Stage 2: Runtime Environment (The "Slim" Stage) ---
FROM nvidia/cuda:12.6.3-runtime-ubuntu24.04 AS runtime
# Install only necessary runtime shared libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
libboost-json1.83.0 \
libboost-program-options1.83.0 \
libgomp1 \
libssl3 \
zlib1g \
&& rm -rf /var/lib/apt/lists/*
ENV APP_ROOT=/app \
LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"
WORKDIR /app/build
# Copy only the compiled binaries from the builder
COPY --from=builder /app/build/biergarten-pipeline ./
# Copy required config files
COPY locations.json /app/build/
COPY beer-styles.json /app/build/
# Copy prompt templates
COPY prompts /app/prompts
# Copy only the necessary shared libraries from builder/llama-bin
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
# Co-locate plugins
RUN cp /usr/local/lib/libggml-cuda.so . 2>/dev/null || true && \
cp /usr/local/lib/libggml-cpu*.so . 2>/dev/null || true
# Setup Start Script
COPY ./runpod/start.sh /usr/local/bin/biergarten-start
RUN chmod +x /usr/local/bin/biergarten-start
ENTRYPOINT ["/usr/local/bin/biergarten-start"]

View File

@@ -1,8 +0,0 @@
```bash
touch runpod/start.sh
docker build \
--progress=plain \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```

View File

@@ -1,22 +0,0 @@
name: biergarten-pipeline-live
imageName: biergarten-pipeline:latest
category: NVIDIA
containerDiskInGb: 50
volumeInGb: 50
volumeMountPath: /workspace
dockerEntrypoint:
- /usr/local/bin/biergarten-start
dockerStartCmd: []
isPublic: false
isServerless: false
env:
BIERGARTEN_MODE: live
BIERGARTEN_MODEL_PATH: /workspace/models/google_gemma-4-E4B-it-Q6_K.gguf
BIERGARTEN_PROMPT_DIR: /workspace/app/build/prompts
BIERGARTEN_OUTPUT_DIR: /workspace/output
BIERGARTEN_LOG_PATH: /workspace/logs/pipeline.log
BIERGARTEN_TEMPERATURE: "1.0"
BIERGARTEN_TOP_P: "0.95"
BIERGARTEN_TOP_K: "64"
BIERGARTEN_N_CTX: "8192"
BIERGARTEN_SEED: "-1"

View File

@@ -1,58 +0,0 @@
#!/bin/bash
set -e
MODEL_PATH="${BIERGARTEN_MODEL_PATH:-/workspace/models/google_gemma-4-E4B-it-Q6_K.gguf}"
OUTPUT_DIR="${BIERGARTEN_OUTPUT_DIR:-/workspace/output}"
LOG_PATH="${BIERGARTEN_LOG_PATH:-/workspace/logs/pipeline.log}"
EXECUTABLE="/app/build/biergarten-pipeline"
PROMPT_DIR="/app/prompts"
echo "--- Starting Biergarten Pipeline Environment Check ---"
# Ensure directories exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$(dirname "$LOG_PATH")"
mkdir -p "$(dirname "$MODEL_PATH")"
# Download model if missing
if [ ! -f "$MODEL_PATH" ]; then
echo "Model not found. Downloading (this may take a while)..."
curl -L -C - \
-o "$MODEL_PATH" \
"https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true"
echo "Download complete."
fi
# Verify model exists
if [ ! -f "$MODEL_PATH" ]; then
echo "ERROR: Model still not found after download attempt."
exit 1
fi
# Default GPU layers
GL_LAYERS="${BIERGARTEN_GL_LAYERS:-40}"
# Build args
ARGS=(
"--model" "$MODEL_PATH"
"--prompt-dir" "$PROMPT_DIR"
"--output" "$OUTPUT_DIR"
"--log-path" "$LOG_PATH"
"--n-gpu-layers" "$GL_LAYERS"
)
# Optional params
[[ -n "$BIERGARTEN_TEMPERATURE" ]] && ARGS+=("--temperature" "$BIERGARTEN_TEMPERATURE")
[[ -n "$BIERGARTEN_TOP_P" ]] && ARGS+=("--top-p" "$BIERGARTEN_TOP_P")
[[ -n "$BIERGARTEN_TOP_K" ]] && ARGS+=("--top-k" "$BIERGARTEN_TOP_K")
[[ -n "$BIERGARTEN_N_CTX" ]] && ARGS+=("--n-ctx" "$BIERGARTEN_N_CTX")
[[ -n "$BIERGARTEN_SEED" ]] && ARGS+=("--seed" "$BIERGARTEN_SEED")
# Extra args
[[ -n "$BIERGARTEN_EXTRA_ARGS" ]] && ARGS+=($BIERGARTEN_EXTRA_ARGS)
echo "--- Executing: $EXECUTABLE ${ARGS[*]} ---"
exec "$EXECUTABLE" "${ARGS[@]}"

View File

@@ -1,158 +0,0 @@
#include <spdlog/spdlog.h>
#include <optional>
#include <sstream>
#include <string>
#include "data_model/models.h"
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Defaults sourced from SamplingOptions{} so the CLI and LlamaGenerator
// share a single source of truth — changing the struct updates both.
auto add_sampling_options = [&]() -> void {
const SamplingOptions sampling_defaults{};
opt("temperature",
prog_opts::value<float>()->default_value(sampling_defaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(sampling_defaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(sampling_defaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
opt("n-gpu-layers", prog_opts::value<int>()->default_value(0),
"Number of layers to offload to GPU");
};
// --mocked and --model are mutually exclusive; validation is enforced below
// rather than at registration to produce a clear diagnostic message.
auto add_generator_options = [&]() -> void {
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
};
auto add_pipeline_options = [&]() -> void {
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
opt("location-count", prog_opts::value<uint32_t>()->default_value(10));
};
add_sampling_options();
add_generator_options();
add_pipeline_options();
// No flags provided — treat as a help request rather than an error.
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
options.pipeline.location_count =
var_map["location-count"].as<uint32_t>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
const int n_gpu_layers = var_map["n-gpu-layers"].as<int>();
// Enforce mutual exclusivity before any further configuration is applied.
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: either --mocked or --model must be specified");
return std::nullopt;
}
// Prompt directory is only meaningful for live inference — the mock
// generator has no use for it and should not require it to be present.
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
spdlog::error(
"Invalid arguments: --prompt-dir is required when not using "
"--mocked");
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
// options.generator.n_gpu_layers = n_gpu_layers;
// Only populate sampling config when the user explicitly overrides at
// least one value. Leaving it as std::nullopt lets LlamaGenerator fall
// back to its own SamplingOptions{} defaults, keeping the two paths
// consistent without redundant copies.
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted() || !var_map["n_gpu_layers"].defaulted();
if (user_provided_sampling) {
// Warn but do not fail — the run is still valid, the flags are just
// silently irrelevant when no model is loaded.
if (use_mocked) {
spdlog::warn("Sampling parameters are ignored when using --mocked");
} else {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
sampling.n_gpu_layers = var_map["n-gpu-layers"].as<int>();
options.generator.sampling = sampling;
}
}
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}

View File

@@ -10,9 +10,7 @@
BiergartenDataGenerator::BiergartenDataGenerator( BiergartenDataGenerator::BiergartenDataGenerator(
std::unique_ptr<IEnrichmentService> context_service, std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator, std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter, std::unique_ptr<IExportService> exporter)
const ApplicationOptions &app_options)
: context_service_(std::move(context_service)), : context_service_(std::move(context_service)),
generator_(std::move(generator)), generator_(std::move(generator)),
exporter_(std::move(exporter)), exporter_(std::move(exporter)) {}
application_options_(app_options) {}

View File

@@ -13,6 +13,8 @@
#include "biergarten_data_generator.h" #include "biergarten_data_generator.h"
#include "json_handling/json_loader.h" #include "json_handling/json_loader.h"
static constexpr size_t kBreweryAmount = 50;
std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() { std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ==="); spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ===");
@@ -21,9 +23,7 @@ std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
auto all_locations = JsonLoader::LoadLocations(locations_path); auto all_locations = JsonLoader::LoadLocations(locations_path);
spdlog::info(" Locations available: {}", all_locations.size()); spdlog::info(" Locations available: {}", all_locations.size());
const size_t sample_count = std::min( const size_t sample_count = std::min(kBreweryAmount, all_locations.size());
static_cast<size_t>(application_options_.pipeline.location_count),
all_locations.size());
const auto sample_count_signed = const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>( static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(

View File

@@ -21,8 +21,8 @@ bool BiergartenDataGenerator::Run() {
for (auto& city : cities) { for (auto& city : cities) {
try { try {
std::string region_context = context_service_->GetLocationContext(city); std::string region_context = context_service_->GetLocationContext(city);
// spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}", spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}",
// city.city, city.iso3166_2, region_context); city.city, city.country, region_context);
enriched.push_back( enriched.push_back(
EnrichedCity{.location = std::move(city), EnrichedCity{.location = std::move(city),

View File

@@ -11,7 +11,7 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "data_model/models.h" #include "data_model/application_options.h"
#include "llama.h" #include "llama.h"
static constexpr uint32_t kMaxContextSize = 32768U; static constexpr uint32_t kMaxContextSize = 32768U;
@@ -89,7 +89,6 @@ LlamaGenerator::LlamaGenerator(
} }
n_ctx_ = sampling.n_ctx; n_ctx_ = sampling.n_ctx;
n_gpu_layers_ = sampling.n_gpu_layers;
this->Load(model_path); this->Load(model_path);
} }

View File

@@ -12,7 +12,6 @@
#include <utility> #include <utility>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "ggml-backend.h"
#include "llama.h" #include "llama.h"
// Maximum batch size for decode operations. Capping the batch prevents // Maximum batch size for decode operations. Capping the batch prevents
@@ -23,12 +22,7 @@ void LlamaGenerator::Load(const std::string& model_path) {
context_.reset(); context_.reset();
model_.reset(); model_.reset();
// Specifically load dynamic ggml backends (like CUDA) that are provided const llama_model_params model_params = llama_model_default_params();
// externally before attempting to load a model.
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers_;
LlamaGenerator::ModelHandle loaded_model( LlamaGenerator::ModelHandle loaded_model(
llama_model_load_from_file(model_path.c_str(), model_params)); llama_model_load_from_file(model_path.c_str(), model_params));
if (!loaded_model) { if (!loaded_model) {

View File

@@ -8,43 +8,178 @@
#include <boost/di.hpp> #include <boost/di.hpp>
#include <boost/program_options.hpp> #include <boost/program_options.hpp>
#include <chrono>
#include <cstdint>
#include <exception> #include <exception>
#include <memory> #include <memory>
#include <optional> #include <optional>
#include <sstream>
#include <string> #include <string>
#include "biergarten_data_generator.h" #include "biergarten_data_generator.h"
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "data_generation/mock_generator.h" #include "data_generation/mock_generator.h"
#include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h" #include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h"
#include "data_model/models.h" #include "data_model/application_options.h"
#include "llama_backend_state.h" #include "llama_backend_state.h"
#include "services/database/export_service.h" #include "services/enrichment_service.h"
#include "services/database/sqlite_export_service.h" #include "services/export_service.h"
#include "services/datetime/timer.h" #include "services/prompt_directory.h"
#include "services/enrichment/enrichment_service.h" #include "services/sqlite_export_service.h"
#include "services/enrichment/mock_enrichment.h" #include "services/wikipedia_service.h"
#include "services/enrichment/wikipedia_service.h" #include "web_client/curl_web_client.h"
#include "services/prompting/prompt_directory.h"
#include "web_client/http_web_client.h"
namespace prog_opts = boost::program_options;
namespace di = boost::di; namespace di = boost::di;
/**
* @brief Parse command-line arguments into ApplicationOptions.
*
* @param argc Command-line argument count.
* @param argv Command-line arguments.
* @return Parsed ApplicationOptions if parsing succeeded, std::nullopt
* otherwise.
*/
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Generator Options
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
// Sampling Options - defaults driven from SamplingOptions struct
const SamplingOptions kSamplingDefaults{};
opt("temperature",
prog_opts::value<float>()->default_value(kSamplingDefaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(kSamplingDefaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(kSamplingDefaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(kSamplingDefaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(kSamplingDefaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
// Pipeline Options
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: Either --mocked or --model must be specified");
return std::nullopt;
}
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
spdlog::error(
"Invalid arguments: --prompt-dir is required when not using "
"--mocked");
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted();
if (use_mocked) {
if (user_provided_sampling) {
spdlog::warn("Sampling parameters are ignored when using --mocked");
}
} else if (user_provided_sampling) {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
options.generator.sampling = sampling;
}
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}
struct Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
};
int main(const int argc, char** argv) { int main(const int argc, char** argv) {
try { try {
Timer timer; Timer timer;
const CurlGlobalState curl_state;
const LlamaBackendState llama_backend_state;
spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v"); spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v");
#ifndef BIERGARTEN_MOCK_ONLY const auto parsed_options = ParseArguments(argc, argv);
const LlamaBackendState llama_backend_state;
#endif
#ifdef DEBUG
spdlog::set_level(spdlog::level::debug);
#endif
const std::optional<ApplicationOptions> parsed_options =
ParseArguments(argc, argv);
if (!parsed_options.has_value()) { if (!parsed_options.has_value()) {
return 0; return 0;
} }
@@ -66,20 +201,12 @@ int main(const int argc, char** argv) {
} }
const auto injector = di::make_injector( const auto injector = di::make_injector(
di::bind<WebClient>().to<CURLWebClient>(),
di::bind<ApplicationOptions>().to(options), di::bind<ApplicationOptions>().to(options),
di::bind<std::string>().to(model_path), di::bind<IEnrichmentService>().to<WikipediaService>(),
di::bind<WebClient>().to<HttpWebClient>(),
di::bind<IExportService>().to<SqliteExportService>(), di::bind<IExportService>().to<SqliteExportService>(),
di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(), di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(),
di::bind<IEnrichmentService>().to( di::bind<std::string>().to(model_path),
[options](const auto& inj) -> std::unique_ptr<IEnrichmentService> {
if (options.generator.use_mocked) {
return std::make_unique<MockEnrichmentService>();
}
return std::make_unique<WikipediaEnrichmentService>(
inj.template create<std::unique_ptr<WebClient>>());
}),
di::bind<DataGenerator>().to( di::bind<DataGenerator>().to(
[options, model_path, sampling, &prompt_directory]( [options, model_path, sampling, &prompt_directory](
const auto& inj) -> std::unique_ptr<DataGenerator> { const auto& inj) -> std::unique_ptr<DataGenerator> {
@@ -98,11 +225,9 @@ int main(const int argc, char** argv) {
options, model_path, options, model_path,
inj.template create<std::unique_ptr<IPromptFormatter>>(), inj.template create<std::unique_ptr<IPromptFormatter>>(),
std::move(prompt_directory)); std::move(prompt_directory));
}) }));
); auto generator =
const auto generator =
injector.create<std::unique_ptr<BiergartenDataGenerator>>(); injector.create<std::unique_ptr<BiergartenDataGenerator>>();
if (!generator->Run()) { if (!generator->Run()) {

View File

@@ -1,112 +0,0 @@
/**
* @file wikipedia/fetch_extract.cc
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <chrono>
#include <format>
#include <string>
#include <string_view>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
using namespace boost;
std::string WikipediaEnrichmentService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
// 1. Cache Lookup
if (const auto cache_it = this->extract_cache_.find(cache_key);
cache_it != this->extract_cache_.end()) {
spdlog::debug("Wikipedia: Cache hit for {}!", cache_key);
return cache_it->second;
}
const std::string encoded = this->client_->EncodeURL(cache_key);
const std::string url = std::format(
"https://en.wikipedia.org/w/"
"api.php?action=query&titles={}&prop=extracts&explaintext=1&format=json",
encoded);
const std::string body = this->client_->Get(url);
{
using namespace std::literals::chrono_literals;
std::this_thread::sleep_for(1s);
}
// 2. Parse JSON
system::error_code ec;
json::value doc = json::parse(body, ec);
if (ec) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
ec.message());
return {};
}
// 3. Safe Extraction
const json::object* obj = doc.if_object();
if (obj == nullptr) {
spdlog::warn("WikipediaService: Expected root object for '{}'", query);
return {};
}
const json::value* query_ptr = obj->if_contains("query");
const json::value* pages_ptr =
((query_ptr != nullptr) && query_ptr->is_object())
? query_ptr->get_object().if_contains("pages")
: nullptr;
if ((pages_ptr == nullptr) || !pages_ptr->is_object()) {
spdlog::warn("WikipediaService: Missing query.pages for '{}'", query);
return {};
}
const json::object& pages = pages_ptr->get_object();
if (pages.empty()) {
spdlog::warn("WikipediaService: No pages returned for '{}'", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
// Wikipedia returns the page under a dynamic ID key; we just want the first
// one
const json::value& page_val = pages.begin()->value();
if (!page_val.is_object()) {
spdlog::warn("WikipediaService: Unexpected page format for '{}'", query);
return {};
}
const json::object& page = page_val.get_object();
// Handle 404/Missing status
if (page.contains("missing")) {
spdlog::warn("WikipediaService: Page '{}' does not exist", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
const json::value* extract_ptr = page.if_contains("extract");
if ((extract_ptr == nullptr) || !extract_ptr->is_string()) {
spdlog::warn("WikipediaService: No extract string found for '{}'", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
// 4. Success
std::string extract(extract_ptr->as_string());
spdlog::info("WikipediaService: Fetched {} chars for '{}'", extract.size(),
query);
this->extract_cache_.insert_or_assign(cache_key, extract);
return extract;
}

View File

@@ -1,58 +0,0 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <chrono>
#include <format>
#include <string>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
std::string WikipediaEnrichmentService::GetLocationContext(const Location& loc) {
using namespace std::literals::chrono_literals;
if (!this->client_) {
spdlog::warn("Client is nullptr.");
return {};
}
std::string result;
// std::string region_query(loc.city);
// if (!loc.country.empty()) {
// region_query += loc.state_province,
// region_query += ", ";
// region_query += loc.country;
// }
constexpr std::string_view brewing_query = "brewing";
const std::string location_query =
std::format("{}, {}", loc.city, loc.iso3166_2);
const std::string beer_query = std::format("beer in {}", loc.country);
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(brewing_query));
append_extract(FetchExtract(beer_query));
spdlog::info("Done fetching for {}. Sleeping for 10 seconds.",
location_query);
std::this_thread::sleep_for(10s);
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", location_query,
e.what());
}
return result;
}

View File

@@ -4,7 +4,7 @@
* construction and loads named prompt files on demand with in-process caching. * construction and loads named prompt files on demand with in-process caching.
*/ */
#include "services/prompting/prompt_directory.h" #include "services/prompt_directory.h"
#include <spdlog/spdlog.h> #include <spdlog/spdlog.h>

View File

@@ -0,0 +1,23 @@
/**
* @file services/sqlite/build_database_path.cc
* @brief SqliteExportService::BuildDatabasePath() implementation.
*/
#include <filesystem>
#include <string>
#include "services/sqlite_export_service.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}

View File

@@ -5,8 +5,8 @@
#include <stdexcept> #include <stdexcept>
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
void SqliteExportService::Finalize() { void SqliteExportService::Finalize() {
if (db_handle_ == nullptr) { if (db_handle_ == nullptr) {
@@ -28,4 +28,4 @@ void SqliteExportService::Finalize() {
RollbackAndCloseNoThrow(); RollbackAndCloseNoThrow();
throw; throw;
} }
} }

View File

@@ -1,4 +1,4 @@
#include "services/database/sqlite_connection_helpers.h" #include "services/sqlite_connection_helpers.h"
#include <stdexcept> #include <stdexcept>

View File

@@ -1,4 +1,4 @@
#include "services/database/sqlite_statement_helpers.h" #include "services/sqlite_statement_helpers.h"
#include <boost/json.hpp> #include <boost/json.hpp>
#include <cstring> #include <cstring>
@@ -6,7 +6,7 @@
#include <memory> #include <memory>
#include <stdexcept> #include <stdexcept>
#include "services/database/sqlite_connection_helpers.h" #include "services/sqlite_connection_helpers.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {

View File

@@ -8,22 +8,8 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}
void SqliteExportService::InitializeSchema() const { void SqliteExportService::InitializeSchema() const {
sqlite_export_service_internal::ExecSql( sqlite_export_service_internal::ExecSql(

View File

@@ -8,8 +8,8 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
constexpr int kLocationPrecision = 17; constexpr int kLocationPrecision = 17;

View File

@@ -3,7 +3,7 @@
* @brief SqliteExportService constructor and destructor implementation. * @brief SqliteExportService constructor and destructor implementation.
*/ */
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include <memory> #include <memory>

View File

@@ -0,0 +1,61 @@
/**
* @file wikipedia/fetch_extract.cc
* @brief WikipediaService::FetchExtract() implementation.
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <string>
#include <string_view>
#include "services/wikipedia_service.h"
std::string WikipediaService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
const auto cache_it = this->extract_cache_.find(cache_key);
if (cache_it != this->extract_cache_.end()) {
return cache_it->second;
}
const std::string encoded = this->client_->UrlEncode(cache_key);
const std::string url =
"https://en.wikipedia.org/w/api.php?action=query&titles=" + encoded +
"&prop=extracts&explaintext=1&format=json";
const std::string body = this->client_->Get(url);
boost::system::error_code parse_error;
boost::json::value doc = boost::json::parse(body, parse_error);
if (!parse_error && doc.is_object()) {
try {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
const std::string_view extract_view = page.at("extract").as_string();
std::string extract(extract_view);
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
this->extract_cache_.emplace(cache_key, extract);
return extract;
}
}
this->extract_cache_.emplace(cache_key, std::string{});
} catch (const std::exception& e) {
spdlog::warn(
"WikipediaService: failed to parse response structure for '{}': "
"{}",
query, e.what());
return {};
}
} else if (parse_error) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
parse_error.message());
}
return {};
}

View File

@@ -0,0 +1,47 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <string>
#include "services/wikipedia_service.h"
std::string WikipediaService::GetLocationContext(const Location& loc) {
if (!client_) {
return {};
}
std::string result;
std::string region_query(loc.city);
if (!loc.country.empty()) {
region_query += ", ";
region_query += loc.country;
}
const std::string beer_query = "beer in " + loc.country;
const std::string city_beer_query = "beer in " + loc.city;
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(region_query));
append_extract(FetchExtract(beer_query));
append_extract(FetchExtract(city_beer_query));
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", region_query,
e.what());
}
return result;
}

View File

@@ -3,10 +3,9 @@
* @brief WikipediaService constructor implementation. * @brief WikipediaService constructor implementation.
*/ */
#include "services/enrichment/wikipedia_service.h" #include "services/wikipedia_service.h"
#include <utility> #include <utility>
WikipediaEnrichmentService::WikipediaEnrichmentService( WikipediaService::WikipediaService(std::unique_ptr<WebClient> client)
std::unique_ptr<WebClient> client)
: client_(std::move(client)) {} : client_(std::move(client)) {}

View File

@@ -0,0 +1,19 @@
/**
* @file web_client/curl_global_state.cc
* @brief CurlGlobalState constructor and destructor implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include "web_client/curl_web_client.h"
CurlGlobalState::CurlGlobalState() {
if (curl_global_init(CURL_GLOBAL_DEFAULT) != CURLE_OK) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl globally");
}
}
CurlGlobalState::~CurlGlobalState() { curl_global_cleanup(); }

View File

@@ -0,0 +1,87 @@
/**
* @file web_client/curl_web_client_get.cc
* @brief CURLWebClient::Get() implementation.
*/
#include <curl/curl.h>
#include <cstdint>
#include <limits>
#include <memory>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
using CurlHandle = std::unique_ptr<CURL, decltype(&curl_easy_cleanup)>;
static constexpr long kConnectionTimeout = 10;
static constexpr long kRequestTimeout = 30;
static constexpr long kMaxRedirects = 5;
static constexpr int32_t kOkHttpStatus = 200;
static CurlHandle CreateHandle() {
CURL* handle = curl_easy_init();
if (handle == nullptr) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl handle");
}
return {handle, &curl_easy_cleanup};
}
static void SetCommonGetOptions(CURL* curl, const std::string& url) {
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, "biergarten-pipeline/0.1.0");
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
curl_easy_setopt(curl, CURLOPT_MAXREDIRS, kMaxRedirects);
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, kConnectionTimeout);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, kRequestTimeout);
curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip");
}
// curl write callback that appends response data into a std::string
static size_t WriteCallbackString(void* contents, const size_t size,
const size_t nmemb, void* userp) {
const size_t real_size = size * nmemb;
auto* str = static_cast<std::string*>(userp);
str->append(static_cast<char*>(contents), real_size);
return real_size;
}
std::string CURLWebClient::Get(const std::string& url) {
const CurlHandle curl = CreateHandle();
std::string response_string;
SetCommonGetOptions(curl.get(), url);
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, WriteCallbackString);
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &response_string);
CURLcode curl_result = curl_easy_perform(curl.get());
if (curl_result != CURLE_OK) {
const auto error = std::string("[CURLWebClient] GET failed: ") +
curl_easy_strerror(curl_result);
throw std::runtime_error(error);
}
long curl_http_code = 0;
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &curl_http_code);
if (curl_http_code < std::numeric_limits<int32_t>::min() ||
curl_http_code > std::numeric_limits<int32_t>::max()) {
throw std::runtime_error("[CURLWebClient] Invalid HTTP status code: " +
std::to_string(curl_http_code));
}
const int32_t http_code = static_cast<int32_t>(curl_http_code);
if (http_code != kOkHttpStatus) {
const std::string error = "[CURLWebClient] HTTP error " +
std::to_string(http_code) + " for URL " + url;
throw std::runtime_error(error);
}
return response_string;
}

View File

@@ -0,0 +1,24 @@
/**
* @file web_client/curl_web_client_url_encode.cc
* @brief CURLWebClient::UrlEncode() implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
std::string CURLWebClient::UrlEncode(const std::string& value) {
// A NULL handle is fine for UTF-8 encoding according to libcurl docs.
char* output = curl_easy_escape(nullptr, value.c_str(), 0);
if (!output) {
throw std::runtime_error("[CURLWebClient] curl_easy_escape failed");
}
std::string result(output);
curl_free(output);
return result;
}

View File

@@ -1,68 +0,0 @@
/**
* @file web_client/http_web_client.cc
* @brief cpp-httplib implementation of WebClient.
*/
#include "web_client/http_web_client.h"
#include <httplib.h>
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include "spdlog/spdlog.h"
namespace {
constexpr time_t kConnectionTimeoutSeconds = 5;
constexpr time_t kReadTimeoutSeconds = 10;
constexpr int kSuccessMin = 200;
constexpr int kSuccessMax = 300;
const std::regex kUrlRegex(
R"(^(https?://[^/?#]+)(/[^?#]*(?:\?[^#]*)?(?:#.*)?)?)");
std::pair<std::string, std::string> SplitUrl(const std::string& url) {
std::smatch match;
if (!std::regex_match(url, match, kUrlRegex)) {
throw std::invalid_argument("[HttpWebClient] Malformed URL: " + url);
}
return {match[1].str(), match[2].matched ? match[2].str() : "/"};
}
} // namespace
std::string HttpWebClient::Get(const std::string& url) {
const auto [origin, path] = SplitUrl(url);
httplib::Client client(origin);
client.set_follow_location(true);
client.set_connection_timeout(kConnectionTimeoutSeconds);
client.set_read_timeout(kReadTimeoutSeconds);
client.set_default_headers({
{"Accept", "application/json"},
{"User-Agent", "biergarten-pipeline/1.0"}
});
const httplib::Result result = client.Get(path);
if (!result) {
throw std::runtime_error(
"[HttpWebClient] Request failed for URL: " + url +
"" + httplib::to_string(result.error()));
}
if (result->status < kSuccessMin || result->status >= kSuccessMax) {
spdlog::error("[HttpWebClient] Request failed for URL: " + url);
throw std::runtime_error(
"[HttpWebClient] HTTP " + std::to_string(result->status) +
" for URL: " + url);
}
return result->body;
}
std::string HttpWebClient::EncodeURL(const std::string& value) {
return httplib::encode_uri_component(value);
}