11 Commits

Author SHA1 Message Date
Aaron Po
5abb3f2e24 Add mock enrichment process 2026-05-14 13:49:59 -04:00
Aaron Po
a057b9197f Add location count to application options and as a cli arg 2026-05-13 22:04:48 -04:00
Aaron Po
773e7c774b Add timeout for enrichment, refactor json deserialization 2026-05-13 12:44:30 -04:00
b7c0b1c8d4 Fix mistake in .gitattributes
archive/* is incorrect as it will ignore sub-dirs
2026-05-12 01:05:07 -04:00
b8ebe03921 Pipeline: Add Runpod docker configuration (#222)
* Begin work on Runpod docker config

* Reduce docker image size

* Create .dockerignore
2026-05-12 00:44:09 -04:00
26635ace84 Organize and consolidate header files (#220) 2026-05-03 21:44:37 -04:00
031be8ad5d Pipeline: Remove CURL as a dependency, add new HTTP module (#219)
Rationale: 

HTTP is a supporting concern in the pipeline, used only for Wikipedia enrichment calls. libcurl's C API required significant boilerplate to wrap safely. cpp-httplib is a header-only library that covers the same functionality with far less overhead and no manual resource management.
2026-05-03 13:35:58 -04:00
f316fabcb0 Update CMakeLists.txt (#218) 2026-05-02 19:27:44 -04:00
b1dc8e0b5d refactor(pipeline): restructure config, add PromptDirectory, consolidate SQLite layer (#217)
* Refactor ApplicationOptions to separate config concerns

* add prompt dir app option

* readability updates: remove magic numbers, update comments

* codebase formatting

* Update docs

* Extract argument parsing, timer out of
2026-05-02 18:27:14 -04:00
641a479b6a Refactor SQLite Export Service and ProcessRecord Method Signatures (#216)
* Helper cleanup

update bind to use dto for params
consolidate translation units


* Update planned class diagram
2026-04-30 19:03:45 -04:00
d80e15b55e Add docs symlink to top level docs/pipeline (#214) 2026-04-30 18:35:46 -04:00
87 changed files with 2132 additions and 1401 deletions

2
.gitattributes vendored
View File

@@ -1 +1 @@
archive/* linguist-vendored archive/** linguist-vendored

View File

@@ -18,6 +18,7 @@ descriptions via a local GGUF model or a deterministic mock.
- [Build](#build) - [Build](#build)
- [Model](#model) - [Model](#model)
- [Run](#run) - [Run](#run)
- [Docker / RunPod](#docker--runpod)
- [Architecture](#architecture) - [Architecture](#architecture)
- [Pipeline Stages](#pipeline-stages) - [Pipeline Stages](#pipeline-stages)
- [Key Components](#key-components) - [Key Components](#key-components)
@@ -51,7 +52,7 @@ step.
### Build ### Build
Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and Requirements: C++20 compiler, CMake 3.31+, OpenSSL, Boost (JSON and
ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system
SQLite package is required. SQLite package is required.
@@ -60,6 +61,16 @@ cmake -S . -B build
cmake --build build cmake --build build
``` ```
CMake automatically detects whether a compatible llama.cpp installation is
present on the system (`libllama`, `libggml`, `libggml-base`, and `llama.h`
visible on the default search paths). If found, it links against those
libraries and skips the FetchContent build. If not found, it fetches and builds
llama.cpp from source at tag `b9012`. No additional flags are required in
either case.
Metal is enabled automatically on Apple Silicon. CUDA or HIP/ROCm is detected
automatically on Linux when the relevant toolkit is present.
### Model ### Model
> Skip this step if you only need `--mocked`. > Skip this step if you only need `--mocked`.
@@ -74,33 +85,124 @@ curl -L \
### Run ### Run
Run from `build/` so the copied `locations.json` and `prompts/` are available. Run from `build/` so the copied `locations.json` and `prompts/` are available.
Each run also writes a fresh dated SQLite file such as Each run writes a fresh dated SQLite file such as
`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. `biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory.
```bash ```bash
./biergarten-pipeline --mocked ./biergarten-pipeline --mocked
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
./biergarten-pipeline \
--model ../models/google_gemma-4-E4B-it-Q6_K.gguf \
--prompt-dir prompts \
--temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
``` ```
#### CLI Flags #### CLI Flags
| Flag | Purpose | | Flag | Purpose |
| --------------- | ------------------------------------------------------- | | --------------- | ---------------------------------------------------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. | | `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. | | `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--temperature` | Sampling temperature. Default: `1.0`. | | `--prompt-dir` | Directory containing prompt files (e.g. `BREWERY_GENERATION.md`). Required unless `--mocked` is set. |
| `--top-p` | Nucleus sampling. Default: `0.95`. | | `--output, -o` | Directory for generated SQLite artifacts. Default: `output`. |
| `--top-k` | Top-k sampling. Default: `64`. | | `--log-path` | Path for application logs. Default: `pipeline.log`. |
| `--n-ctx` | Context window size. Default: `8192`. | | `--temperature` | Sampling temperature. Default: `1.0`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). | | `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--help, -h` | Print usage and exit. | | `--top-k` | Top-k sampling. Default: `64`. |
| `--n-ctx` | Context window size. Default: `8192`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). |
| `--help, -h` | Print usage and exit. |
`--mocked` and `--model` are mutually exclusive. Omitting both exits with an `--mocked` and `--model` are mutually exclusive. Omitting both exits with an
error before the pipeline starts. Sampling flags are ignored when `--mocked` is error before the pipeline starts. Sampling flags are ignored when `--mocked` is
set. set.
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after The post-build step copies `prompts/` into `build/prompts/`. Rebuild after
editing `prompts/system.md`. editing any prompt file.
---
## Docker / RunPod
The `tooling/pipeline/runpod/` directory contains a GPU-ready container
configuration for running the pipeline on RunPod or any Docker host with an
NVIDIA GPU.
### How it works
The container uses a two-stage build. The first stage pulls prebuilt
`libllama`, `libggml`, and backend plugin libraries (including `libggml-cuda.so`
and the CPU variant plugins) from `ghcr.io/ggml-org/llama.cpp:full-cuda`. The
second stage copies those libraries into `/usr/local/lib` and runs `ldconfig` so
the dynamic linker and `dlopen` calls from `ggml_backend_load_all()` can resolve
the CUDA backend plugin at runtime. llama.cpp headers are cloned at the matching
tag and installed into `/usr/local/include`. CMake auto-detects both and skips
the FetchContent source build entirely, keeping image build times short.
`GGML_BACKEND_PATH` is set to `/usr/local/lib` so llama.cpp knows where to scan
for backend plugins.
### Build the image
Run from the `tooling/pipeline/` directory (the CMake project root), not from
inside `runpod/`, so the `COPY . .` step picks up the full project context.
```bash
docker build -t biergarten-pipeline:latest -f runpod/Dockerfile .
```
To monitor the full build output and confirm CMake selects the system llama.cpp:
```bash
docker build \
--progress=plain \
--no-cache \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```
Look for `[biergarten] Found system llama.cpp — skipping FetchContent` in the
output to confirm the fast path was taken.
### Run in mocked mode
No model or GPU required. Useful for validating the pipeline logic and SQLite
export path.
```bash
docker run --rm \
-e BIERGARTEN_MODE=mocked \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
### Run in live mode
Mount your GGUF model before starting. The container validates the model path
before launching the binary.
```bash
docker run --rm \
--runtime=nvidia \
-e BIERGARTEN_MODE=live \
-e GGML_BACKEND_PATH="/usr/local/lib/libggml-cuda.so" \
-v "$PWD/models:/workspace/models" \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
The model must be present at `./models/google_gemma-4-E4B-it-Q6_K.gguf` on the
host. See [Model](#model) above for the download command.
### RunPod deployment
Use a GPU pod template. Mount persistent storage for `/workspace/models`,
`/workspace/output`, and `/workspace/logs`. Set `BIERGARTEN_MODE=live` in the
template environment. See `tooling/pipeline/runpod/pod-template.yaml` for a
starter template.
--- ---
@@ -197,16 +299,18 @@ code, latitude, and longitude for each entry.
## Tech Stack ## Tech Stack
- C++20 - C++20
- CMake 3.24+ - CMake 3.31+
- Boost.JSON, Boost.ProgramOptions, Boost.DI - Boost.JSON, Boost.ProgramOptions, Boost.DI
- spdlog - spdlog
- libcurl - cpp-httplib (with OpenSSL)
- SQLite amalgamation fetched and compiled via CMake FetchContent - SQLite amalgamation fetched and compiled via CMake FetchContent
- llama.cpp - llama.cpp (auto-detected from system install or fetched via FetchContent)
- Docker with NVIDIA CUDA 12.6 base image for GPU container builds
- RunPod for cloud GPU inference
The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is The build fetches Boost.DI, spdlog, and SQLite via CMake. llama.cpp is fetched
enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit only when a system installation is not detected. Metal is enabled on Apple
is present. Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
> **Code Style:** Modern C++20 throughout — RAII for ownership, > **Code Style:** Modern C++20 throughout — RAII for ownership,
> `std::unique_ptr` for injected dependencies, `std::optional` for parse > `std::unique_ptr` for injected dependencies, `std::optional` for parse
@@ -218,7 +322,7 @@ is present.
## Tested Hardware ## Tested Hardware
### ARM macOS - M1 Pro ### ARM macOS M1 Pro
| | | | | |
| --------- | --------------------------------- | | --------- | --------------------------------- |
@@ -229,7 +333,7 @@ is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with Metal | | Inference | llama.cpp with Metal |
### x86_64 Linux - NVIDIA RTX 2000 ### x86_64 Linux NVIDIA RTX 2000
| | | | | |
| --------- | ------------------------------ | | --------- | ------------------------------ |
@@ -240,6 +344,15 @@ is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with CUDA 12.x | | Inference | llama.cpp with CUDA 12.x |
### x86_64 Linux — Docker / RunPod (NVIDIA CUDA)
| | |
| --------- | ------------------------------------------- |
| Host | RunPod GPU pod |
| Base | nvidia/cuda:12.6.3-devel-ubuntu24.04 |
| Model | Gemma 4 E4B Q6_K |
| Inference | llama.cpp prebuilt CUDA backends via dlopen |
--- ---
## Fixture Strategy ## Fixture Strategy
@@ -260,8 +373,9 @@ is present.
| `includes/` | Public headers and shared models. | | `includes/` | Public headers and shared models. |
| `src/` | Implementation files. | | `src/` | Implementation files. |
| `locations.json` | Curated city input copied into the build tree. | | `locations.json` | Curated city input copied into the build tree. |
| `prompts/` | System prompt used by the model-backed path. | | `prompts/` | System prompts used by the model-backed path. |
| `diagrams/` | Architecture and pipeline diagrams. | | `diagrams/` | Architecture and pipeline diagrams. |
| `tooling/pipeline/runpod/` | Dockerfile, launcher, and RunPod pod template. |
| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. | | `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. |
--- ---
@@ -276,6 +390,7 @@ is present.
- `src/data_generation/llama/` — local inference, prompt loading, output - `src/data_generation/llama/` — local inference, prompt loading, output
validation. validation.
- `src/data_generation/mock/` — deterministic fallback. - `src/data_generation/mock/` — deterministic fallback.
- `tooling/pipeline/runpod/` — container build and runtime launcher.
--- ---

View File

@@ -29,7 +29,7 @@ if (Are arguments valid?) then (no)
else (yes) else (yes)
endif endif
:Init CurlGlobalState & LlamaBackendState; :Init OpenSSL global state & LlamaBackendState;
:di::make_injector(...); :di::make_injector(...);
:injector.create<std::unique_ptr<BiergartenDataGenerator>>(); :injector.create<std::unique_ptr<BiergartenDataGenerator>>();
:BiergartenDataGenerator::Run(); :BiergartenDataGenerator::Run();

View File

@@ -52,7 +52,7 @@ interface WebClient <<interface>> {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class CURLWebClient { class HttpWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -130,7 +130,7 @@ BiergartenDataGenerator *-- IExportService : owns
IEnrichmentService <|.. WikipediaService : implements IEnrichmentService <|.. WikipediaService : implements
WikipediaService *-- WebClient : owns WikipediaService *-- WebClient : owns
WebClient <|.. CURLWebClient : implements WebClient <|.. HttpWebClient : implements
DataGenerator <|.. MockGenerator : implements DataGenerator <|.. MockGenerator : implements
DataGenerator <|.. LlamaGenerator : implements DataGenerator <|.. LlamaGenerator : implements

View File

@@ -13,7 +13,7 @@ if (Invalid args?) then (yes)
stop stop
else (no) else (no)
endif endif
:Init CurlGlobalState & LlamaBackendState; :Init OpenSSL global state & LlamaBackendState;
:Build DI injector; :Build DI injector;
:Initialize SqliteExportService; :Initialize SqliteExportService;

View File

@@ -141,37 +141,38 @@ package "Domain: Models" {
LocationContext *-- Completeness LocationContext *-- Completeness
} }
@startuml
package "Domain: Application Configuration"{ package "Domain: Application Configuration" {
class SamplingOptions { class SamplingOptions {
+ temperature : float = 1.0F + temperature: float = 1.0F
+ top_p : float = 0.95F + top_p: float = 0.95F
+ top_k : uint32_t = 64 + top_k: uint32_t = 64
+ n_ctx : uint32_t = 8192 + n_ctx: uint32_t = 8192
+ seed : int = -1 + seed: int = -1
} }
class GeneratorOptions { class GeneratorOptions {
+ model_path : std::filesystem::path + model_path: std::filesystem::path
+ use_mocked : bool = false + use_mocked: bool = false
+ sampling : SamplingOptions + sampling: std::optional<SamplingOptions>
} }
class PipelineOptions { class PipelineOptions {
+ output_path : std::filesystem::path + output_path: std::filesystem::path
+ log_path : std::filesystem::path + log_path: std::filesystem::path
} }
class ApplicationOptions { class ApplicationOptions {
+ generator : GeneratorOptions + generator: GeneratorOptions
+ pipeline : PipelineOptions + pipeline: PipelineOptions
} }
' --- Domain Model Relationships --- ' --- Domain Model Relationships ---
ApplicationOptions *-- GeneratorOptions ApplicationOptions *-- GeneratorOptions
ApplicationOptions *-- PipelineOptions ApplicationOptions *-- PipelineOptions
GeneratorOptions *-- SamplingOptions GeneratorOptions o-- SamplingOptions
} }
@endum
package "Domain: Policy" { package "Domain: Policy" {
@@ -355,7 +356,7 @@ package "Infrastructure: Enrichment" {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class CURLWebClient { class HttpWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -435,12 +436,12 @@ package "Infrastructure: Data Export" {
- location_cache_ : std::unordered_map<std::string, uint64_t> - location_cache_ : std::unordered_map<std::string, uint64_t>
- brewery_cache_ : std::unordered_map<std::string, uint64_t> - brewery_cache_ : std::unordered_map<std::string, uint64_t>
+ Initialize() : void + Initialize() : void
+ ProcessBrewery(brewery : const GeneratedBrewery&) : uint64_t + ProcessRecord(brewery : const GeneratedBrewery&) : uint64_t
+ ProcessBeer(beer : const GeneratedBeer&) : uint64_t + ProcessRecord(beer : const GeneratedBeer&) : uint64_t
+ ProcessUser(user : const GeneratedUser&) : uint64_t + ProcessRecord(user : const GeneratedUser&) : uint64_t
+ ProcessCheckin(checkin : const GeneratedCheckin&) : uint64_t + ProcessRecord(checkin : const GeneratedCheckin&) : uint64_t
+ ProcessRating(rating : const GeneratedRating&) : void + ProcessRecord(rating : const GeneratedRating&) : void
+ ProcessFollow(follow : const GeneratedFollow&) : void + ProcessRecord(follow : const GeneratedFollow&) : void
+ Finalize() : void + Finalize() : void
- InitializeSchema() : void - InitializeSchema() : void
- PrepareStatements() : void - PrepareStatements() : void
@@ -519,7 +520,7 @@ CheckinDistributionStrategy <|.. RandomCheckinStrategy
FollowGenerationStrategy <|.. RandomFollowStrategy FollowGenerationStrategy <|.. RandomFollowStrategy
FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy
EnrichmentService <|.. WikipediaService EnrichmentService <|.. WikipediaService
WebClient <|.. CURLWebClient WebClient <|.. HttpWebClient
DataGenerator <|.. MockGenerator DataGenerator <|.. MockGenerator
DataGenerator <|.. LlamaGenerator DataGenerator <|.. LlamaGenerator
PromptFormatter <|.. Gemma4JinjaPromptFormatter PromptFormatter <|.. Gemma4JinjaPromptFormatter

View File

@@ -0,0 +1,9 @@
build/
cmake-build-debug/
.git/
.idea/
**/*.sqlite
**/*.log
**/*.sqlite3
**/*.db

View File

@@ -6,3 +6,4 @@ data
models models
*.gguf *.gguf
BiergartenPipeline.png BiergartenPipeline.png
output

View File

@@ -1,181 +1,255 @@
cmake_minimum_required(VERSION 3.24) cmake_minimum_required(VERSION 3.31)
project(biergarten-pipeline) project(biergarten-pipeline)
set(CMAKE_POLICY_VERSION_MINIMUM 3.5 CACHE STRING "" FORCE) # Set policy to allow FetchContent_Populate for header-only libraries
# that have outdated CMakeLists.txt files
cmake_policy(SET CMP0169 OLD)
# ============================================================================= # 1. Build Options
# 1. Platform & GPU Detection
# ============================================================================= option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF)
if(WIN32) if(BIERGARTEN_MOCK_ONLY)
message(FATAL_ERROR "[biergarten] Windows is currently not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).") message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.")
endif()
# 2. Platform & GPU Detection
if(NOT UNIX)
message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif() endif()
if(APPLE) if(APPLE)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64") if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64")
message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.") message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.")
set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE) set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE)
else() else()
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.") message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE) set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif() endif()
elseif(UNIX AND NOT APPLE) else()
find_package(CUDAToolkit QUIET) find_package(CUDAToolkit QUIET)
find_package(HIP QUIET) find_package(hip CONFIG QUIET)
if(CUDAToolkit_FOUND) if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.") message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE) set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native) set(CMAKE_CUDA_ARCHITECTURES native)
elseif(HIP_FOUND OR EXISTS "/opt/rocm") elseif(hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.") message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE) set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else() else()
message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.") message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.")
endif() endif()
endif() endif()
# ============================================================================= # 3. Project-wide Settings
# 2. Project-wide Settings (Standard & Optimization)
# =============================================================================
set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
# Release Build Optimization: Aggressive (-O3), Arch-specific, and LTO
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g") set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g")
# ============================================================================= # 4. Dependencies
# 3. Dependencies
# =============================================================================
include(FetchContent) include(FetchContent)
find_package(CURL QUIET) # Boost (system install — via dnf/brew)
if(NOT CURL_FOUND)
message(FATAL_ERROR "[biergarten] libcurl not found. Install it (e.g. 'sudo dnf install libcurl-devel').")
endif()
# Require system Boost for JSON and Program Options to speed up build times
find_package(Boost REQUIRED COMPONENTS json program_options) find_package(Boost REQUIRED COMPONENTS json program_options)
# Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency)
# Header-only library, so we only fetch without invoking its CMakeLists.txt
FetchContent_Declare( FetchContent_Declare(
sqlite_amalgamation boost-di
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip GIT_REPOSITORY https://github.com/boost-ext/di.git
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b GIT_TAG v1.3.0
GIT_SHALLOW TRUE
) )
FetchContent_GetProperties(sqlite_amalgamation) FetchContent_GetProperties(boost-di)
if(NOT sqlite_amalgamation_POPULATED) if(NOT boost-di_POPULATED)
FetchContent_Populate(sqlite_amalgamation) FetchContent_Populate(boost-di)
endif() endif()
add_library(boost_di INTERFACE)
add_library(boost::di ALIAS boost_di)
target_include_directories(boost_di INTERFACE
$<BUILD_INTERFACE:${boost-di_SOURCE_DIR}/include>
)
# SQLite amalgamation
FetchContent_Declare(
sqlite_amalgamation
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
EXCLUDE_FROM_ALL
)
FetchContent_MakeAvailable(sqlite_amalgamation)
if(NOT TARGET sqlite3) if(NOT TARGET sqlite3)
add_library(sqlite3 STATIC add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c)
${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR})
) target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1)
target_include_directories(sqlite3 PUBLIC
${sqlite_amalgamation_SOURCE_DIR}
)
target_compile_definitions(sqlite3 PUBLIC
SQLITE_THREADSAFE=1
)
endif() endif()
FetchContent_Declare( # llama.cpp — skipped for mock-only builds
llama-cpp if(NOT BIERGARTEN_MOCK_ONLY)
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git find_library(LLAMA_LIB NAMES llama)
GIT_TAG b8742 find_library(GGML_LIB NAMES ggml)
) find_library(GGML_BASE_LIB NAMES ggml-base)
FetchContent_MakeAvailable(llama-cpp) find_path(LLAMA_INC_DIR NAMES llama.h PATH_SUFFIXES include)
FetchContent_Declare( if(LLAMA_LIB AND GGML_LIB AND GGML_BASE_LIB AND LLAMA_INC_DIR)
boost-di message(STATUS "[biergarten] Found system llama.cpp — skipping FetchContent")
GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0 add_library(llama SHARED IMPORTED)
) set_target_properties(llama PROPERTIES
FetchContent_MakeAvailable(boost-di) IMPORTED_LOCATION "${LLAMA_LIB}"
if(TARGET Boost.DI AND NOT TARGET boost::di) INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INC_DIR}"
add_library(boost::di ALIAS Boost.DI) INTERFACE_LINK_LIBRARIES "${GGML_LIB};${GGML_BASE_LIB}"
)
else()
message(STATUS "[biergarten] System llama.cpp not found — fetching via FetchContent")
FetchContent_Declare(
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b9012
)
FetchContent_MakeAvailable(llama-cpp)
endif()
endif() endif()
# spdlog
FetchContent_Declare( FetchContent_Declare(
spdlog spdlog
GIT_REPOSITORY https://github.com/gabime/spdlog.git GIT_REPOSITORY https://github.com/gabime/spdlog.git
GIT_TAG v1.15.3 GIT_TAG v1.15.3
) )
FetchContent_MakeAvailable(spdlog) FetchContent_MakeAvailable(spdlog)
# ============================================================================= # cpp-httplib — header-only HTTP/HTTPS client replacing libcurl.
# 4. Sources # OpenSSL is required for HTTPS (Wikipedia API). find_package locates
# ============================================================================= # libssl/libcrypto; HTTPLIB_REQUIRE_OPENSSL causes a hard build failure
set(SOURCES # if OpenSSL is absent rather than silently producing an HTTP-only binary.
src/main.cc find_package(OpenSSL REQUIRED)
src/biergarten_data_generator/biergarten_data_generator.cc FetchContent_Declare(
src/biergarten_data_generator/run.cc cpp-httplib
src/biergarten_data_generator/query_cities_with_countries.cc GIT_REPOSITORY https://github.com/yhirose/cpp-httplib.git
src/biergarten_data_generator/generate_breweries.cc GIT_TAG v0.43.2
src/biergarten_data_generator/log_results.cc GIT_SHALLOW TRUE
src/services/wikipedia/wikipedia_service.cc SYSTEM
src/services/wikipedia/get_summary.cc )
src/services/wikipedia/fetch_extract.cc set(HTTPLIB_REQUIRE_OPENSSL ON CACHE BOOL "Require OpenSSL for cpp-httplib" FORCE)
src/services/sqlite/sqlite_export_service.cc FetchContent_MakeAvailable(cpp-httplib)
src/services/sqlite/build_database_path.cc
src/services/sqlite/build_location_key.cc # 5. Executable & Sources
src/services/sqlite/initialize_schema.cc add_executable(${PROJECT_NAME}
src/services/sqlite/prepare_statements.cc includes/services/enrichment/mock_enrichment.h)
src/services/sqlite/initialize.cc
src/services/sqlite/process_record.cc # --- Entry point ---
src/services/sqlite/finalize_statements.cc target_sources(${PROJECT_NAME} PRIVATE
src/services/sqlite/rollback_and_close_no_throw.cc src/main.cc
src/services/sqlite/finalize.cc
src/web_client/curl_global_state.cc
src/web_client/curl_web_client_get.cc
src/web_client/curl_web_client_url_encode.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/generate_user.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/load.cc
src/data_generation/llama/load_brewery_prompt.cc
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
src/data_generation/mock/deterministic_hash.cc
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/json_handling/json_loader.cc
) )
# ============================================================================= # --- json_handling ---
# 5. Target target_sources(${PROJECT_NAME} PRIVATE
# ============================================================================= src/json_handling/json_loader.cc
add_executable(${PROJECT_NAME} ${SOURCES}) )
# --- application_options ---
target_sources(${PROJECT_NAME} PRIVATE
src/application_options/parse_arguments.cc
)
# --- biergarten_data_generator ---
target_sources(${PROJECT_NAME} PRIVATE
src/biergarten_data_generator/log_results.cc
src/biergarten_data_generator/biergarten_data_generator.cc
src/biergarten_data_generator/generate_breweries.cc
src/biergarten_data_generator/run.cc
src/biergarten_data_generator/query_cities_with_countries.cc
)
# --- web_client ---
target_sources(${PROJECT_NAME} PRIVATE
src/web_client/http_web_client.cc
)
# --- data_generation: prompt_formatting ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
)
# --- data_generation: mock ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/data_generation/mock/deterministic_hash.cc
)
# --- data_generation: llama (skipped for mock-only builds) ---
if(NOT BIERGARTEN_MOCK_ONLY)
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/llama/load.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_user.cc
)
endif()
# --- services: wikipedia ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/enrichment/wikipedia/wikipedia_service.cc
src/services/enrichment/wikipedia/fetch_extract.cc
src/services/enrichment/wikipedia/get_summary.cc
)
# --- services: sqlite ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/sqlite/process_record.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/finalize.cc
src/services/sqlite/initialize.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cc
src/services/sqlite/helpers/sqlite_statement_helpers.cc
)
# --- services (top-level) ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/prompt_directory.cc
)
# 6. Include Directories, Link Libraries & Compile Definitions
target_include_directories(${PROJECT_NAME} PRIVATE target_include_directories(${PROJECT_NAME} PRIVATE
includes includes
${llama-cpp_SOURCE_DIR}/include
${llama-cpp_SOURCE_DIR}/common
) )
target_link_libraries(${PROJECT_NAME} PRIVATE target_link_libraries(${PROJECT_NAME} PRIVATE
llama $<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:llama>
boost::di boost::di
Boost::json Boost::json
Boost::program_options Boost::program_options
spdlog::spdlog spdlog::spdlog
sqlite3 sqlite3
CURL::libcurl httplib::httplib
OpenSSL::SSL
OpenSSL::Crypto
) )
# ============================================================================= target_compile_definitions(${PROJECT_NAME} PRIVATE
# 6. Runtime Assets # Defined when -DBIERGARTEN_MOCK_ONLY=ON — skips llama.cpp entirely.
# ============================================================================= # Use #ifdef BIERGARTEN_MOCK_ONLY in source to guard llama-specific code.
$<$<BOOL:${BIERGARTEN_MOCK_ONLY}>:BIERGARTEN_MOCK_ONLY>
# Defined for Debug configuration builds.
# Use #ifdef DEBUG in source to enable debug-only behaviour (e.g. verbose logging).
$<$<CONFIG:Debug>:DEBUG>
)
# 7. Runtime Assets
configure_file( configure_file(
${CMAKE_SOURCE_DIR}/locations.json ${CMAKE_SOURCE_DIR}/locations.json
${CMAKE_BINARY_DIR}/locations.json ${CMAKE_BINARY_DIR}/locations.json
COPYONLY COPYONLY
) )
add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_directory COMMAND ${CMAKE_COMMAND} -E copy_directory
${CMAKE_SOURCE_DIR}/prompts ${CMAKE_SOURCE_DIR}/prompts
${CMAKE_BINARY_DIR}/prompts ${CMAKE_BINARY_DIR}/prompts
) )

1
tooling/pipeline/docs Symbolic link
View File

@@ -0,0 +1 @@
../../docs/pipeline/

View File

@@ -11,11 +11,9 @@
#include <vector> #include <vector>
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_model/enriched_city.h" #include "data_model/generated_models.h"
#include "data_model/generated_brewery.h" #include "services/database/export_service.h"
#include "data_model/location.h" #include "services/enrichment/enrichment_service.h"
#include "services/enrichment_service.h"
#include "services/export_service.h"
/** /**
* @brief Main data generator class for the Biergarten pipeline. * @brief Main data generator class for the Biergarten pipeline.
@@ -34,7 +32,8 @@ class BiergartenDataGenerator {
*/ */
BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service, BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator, std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter); std::unique_ptr<IExportService> exporter,
const ApplicationOptions& application_options);
/** /**
* @brief Run the data generation pipeline. * @brief Run the data generation pipeline.
@@ -58,12 +57,14 @@ class BiergartenDataGenerator {
/// @brief Storage backend for generated brewery records. /// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_; std::unique_ptr<IExportService> exporter_;
const ApplicationOptions application_options_;
/** /**
* @brief Load locations from JSON and sample cities. * @brief Load locations from JSON and sample cities.
* *
* @return Vector of sampled locations capped at 50 entries. * @return Vector of sampled locations capped at 50 entries.
*/ */
static std::vector<Location> QueryCitiesWithCountries(); std::vector<Location> QueryCitiesWithCountries();
/** /**
* @brief Generate breweries for enriched cities. * @brief Generate breweries for enriched cities.

View File

@@ -8,9 +8,7 @@
#include <string> #include <string>
#include "data_model/brewery_result.h" #include "data_model/generated_models.h"
#include "data_model/location.h"
#include "data_model/user_result.h"
/** /**
* @brief Interface for data generator implementations. * @brief Interface for data generator implementations.

View File

@@ -14,9 +14,10 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "../services/prompting/prompt_directory.h"
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_generation/prompt_formatting/prompt_formatter.h" #include "data_generation/prompt_formatting/prompt_formatter.h"
#include "data_model/application_options.h" #include "data_model/models.h"
struct llama_model; struct llama_model;
struct llama_context; struct llama_context;
@@ -33,10 +34,12 @@ class LlamaGenerator final : public DataGenerator {
* @param options Parsed application options. * @param options Parsed application options.
* @param model_path Filesystem path to GGUF model assets. * @param model_path Filesystem path to GGUF model assets.
* @param prompt_formatter Formatter that produces model-specific prompts. * @param prompt_formatter Formatter that produces model-specific prompts.
* @param prompt_directory Directory service for loading named prompt files.
*/ */
LlamaGenerator(const ApplicationOptions& options, LlamaGenerator(const ApplicationOptions& options,
const std::string& model_path, const std::string& model_path,
std::unique_ptr<IPromptFormatter> prompt_formatter); std::unique_ptr<IPromptFormatter> prompt_formatter,
std::unique_ptr<IPromptDirectory> prompt_directory);
~LlamaGenerator() override; ~LlamaGenerator() override;
@@ -119,15 +122,6 @@ class LlamaGenerator final : public DataGenerator {
int max_tokens = kDefaultMaxTokens, int max_tokens = kDefaultMaxTokens,
std::string_view grammar = {}); std::string_view grammar = {});
/**
* @brief Loads the brewery system prompt from disk.
*
* @param prompt_file_path Prompt file path to try first.
* @return Loaded prompt text.
*/
std::string LoadBrewerySystemPrompt(
const std::filesystem::path& prompt_file_path);
ModelHandle model_; ModelHandle model_;
ContextHandle context_; ContextHandle context_;
float sampling_temperature_ = 1.0F; float sampling_temperature_ = 1.0F;
@@ -135,8 +129,9 @@ class LlamaGenerator final : public DataGenerator {
uint32_t sampling_top_k_ = kDefaultSamplingTopK; uint32_t sampling_top_k_ = kDefaultSamplingTopK;
std::mt19937 rng_; std::mt19937 rng_;
uint32_t n_ctx_ = kDefaultContextSize; uint32_t n_ctx_ = kDefaultContextSize;
std::string brewery_system_prompt_; int n_gpu_layers_ = 0;
std::unique_ptr<IPromptFormatter> prompt_formatter_; std::unique_ptr<IPromptFormatter> prompt_formatter_;
std::unique_ptr<IPromptDirectory> prompt_directory_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_LLAMA_GENERATOR_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_LLAMA_GENERATOR_H_

View File

@@ -12,7 +12,7 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "data_model/brewery_result.h" #include "data_model/generated_models.h"
struct llama_vocab; struct llama_vocab;
using llama_token = int32_t; using llama_token = int32_t;

View File

@@ -44,6 +44,13 @@ class MockGenerator final : public DataGenerator {
*/ */
static size_t DeterministicHash(const Location& location); static size_t DeterministicHash(const Location& location);
// Hash stride constants for deterministic distribution across fixed-size
// arrays. These coprime strides spread hash values uniformly without
// clustering, ensuring diverse output across different hash inputs.
static constexpr size_t kNounHashStride = 7;
static constexpr size_t kDescriptionHashStride = 13;
static constexpr size_t kBioHashStride = 11;
static constexpr std::array<std::string_view, 18> kBreweryAdjectives = { static constexpr std::array<std::string_view, 18> kBreweryAdjectives = {
"Craft", "Heritage", "Local", "Artisan", "Pioneer", "Golden", "Craft", "Heritage", "Local", "Artisan", "Pioneer", "Golden",
"Modern", "Classic", "Summit", "Northern", "Riverstone", "Barrel", "Modern", "Classic", "Summit", "Northern", "Riverstone", "Barrel",

View File

@@ -1,4 +1,5 @@
#pragma once #ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -13,3 +14,5 @@ class Gemma4JinjaPromptFormatter final : public IPromptFormatter {
[[nodiscard]] std::string Format(std::string_view system_prompt, [[nodiscard]] std::string Format(std::string_view system_prompt,
std::string_view user_prompt) const override; std::string_view user_prompt) const override;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_

View File

@@ -1,4 +1,5 @@
#pragma once #ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -15,3 +16,5 @@ class IPromptFormatter {
[[nodiscard]] virtual std::string Format( [[nodiscard]] virtual std::string Format(
std::string_view system_prompt, std::string_view user_prompt) const = 0; std::string_view system_prompt, std::string_view user_prompt) const = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_

View File

@@ -1,42 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
/**
* @file data_model/application_options.h
* @brief Program options for the Biergarten pipeline application.
*/
#include <cstdint>
#include <string>
/**
* @brief Program options for the Biergarten pipeline application.
*/
struct ApplicationOptions {
/// @brief Path to the LLM model file (gguf format); mutually exclusive with
/// use_mocked.
std::string model_path;
/// @brief Use mocked generator instead of LLM; mutually exclusive with
/// model_path.
bool use_mocked = false;
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter (0.0 to 1.0, higher = more
/// random).
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens) for LLM inference. Higher values
/// support longer prompts but use more memory.
uint32_t n_ctx = 8192;
/// @brief Random seed for sampling (-1 for random, otherwise non-negative).
int seed = -1;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_

View File

@@ -1,22 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
/**
* @file data_model/brewery_location.h
* @brief Non-owning brewery location input.
*/
#include <string_view>
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_

View File

@@ -1,28 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
/**
* @file data_model/brewery_result.h
* @brief Generated brewery payload.
*/
#include <string>
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_

View File

@@ -1,21 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
/**
* @file data_model/enriched_city.h
* @brief Enriched city data with Wikipedia context.
*/
#include <string>
#include "data_model/location.h"
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_

View File

@@ -1,20 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
/**
* @file data_model/generated_brewery.h
* @brief Helper struct to store generated brewery data.
*/
#include "data_model/brewery_result.h"
#include "data_model/location.h"
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_

View File

@@ -0,0 +1,66 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
/**
* @file data_model/generated_models.h
* @brief Generated output models from the pipeline: brewery/user results, enriched data,
* and complete generation results.
*/
#include <string>
#include "data_model/models.h"
// ============================================================================
// Generation Output Models
// ============================================================================
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
// ============================================================================
// Pipeline Data Models
// ============================================================================
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_

View File

@@ -1,13 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
/**
* @file data_model/generation_models.h
* @brief Convenience include for shared generation payload models.
*/
#include "data_model/brewery_location.h"
#include "data_model/brewery_result.h"
#include "data_model/user_result.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_

View File

@@ -1,41 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
/**
* @file data_model/location.h
* @brief Location data model used throughout generation pipeline.
*/
#include <string>
#include <vector>
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_

View File

@@ -0,0 +1,141 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
/**
* @file data_model/models.h
* @brief Core data models: locations, application configuration, and generation
* inputs.
*/
#include <boost/program_options.hpp>
#include <cstdint>
#include <filesystem>
#include <optional>
#include <string>
#include <string_view>
#include <vector>
namespace prog_opts = boost::program_options;
// ============================================================================
// Location Models
// ============================================================================
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
// ============================================================================
// Configuration Models
// ============================================================================
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
/// @brief Number of layers to offload to GPU.
int n_gpu_layers = 0;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
/// @brief Number of locations to sample from the dataset
/// More locations -> more users/more breweries
uint32_t location_count;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
// ============================================================================
// Function Declarations
// ============================================================================
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv);
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_

View File

@@ -1,12 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
/**
* @file data_model/pipeline_models.h
* @brief Convenience include for pipeline-specific data models.
*/
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_

View File

@@ -1,22 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
/**
* @file data_model/user_result.h
* @brief Generated user profile payload.
*/
#include <string>
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_

View File

@@ -9,7 +9,7 @@
#include <filesystem> #include <filesystem>
#include <vector> #include <vector>
#include "data_model/location.h" #include "data_model/models.h"
/// @brief Loads curated world locations from a JSON file into memory. /// @brief Loads curated world locations from a JSON file into memory.
class JsonLoader { class JsonLoader {

View File

@@ -1,12 +1,14 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
/** /**
* @file services/export_service.h * @file services/export_service.h
* @brief Abstraction for persisting generated brewery data. * @brief Abstraction for persisting generated brewery data.
*/ */
#include "data_model/generated_brewery.h" #include <cstdint>
#include "data_model/generated_models.h"
/** /**
* @brief Interface for services that persist generated brewery records. * @brief Interface for services that persist generated brewery records.
@@ -31,10 +33,10 @@ class IExportService {
* *
* @param brewery Generated brewery payload to store. * @param brewery Generated brewery payload to store.
*/ */
virtual void ProcessRecord(const GeneratedBrewery& brewery) = 0; virtual uint64_t ProcessRecord(const GeneratedBrewery& brewery) = 0;
/// @brief Finalizes the export destination. /// @brief Finalizes the export destination.
virtual void Finalize() = 0; virtual void Finalize() = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_

View File

@@ -0,0 +1,30 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
/**
* @file services/sqlite_connection_helpers.h
* @brief Declarations for connection-level SQLite helper functions.
*/
#include <sqlite3.h>
#include <filesystem>
#include <string>
#include <string_view>
#include "sqlite_handle_types.h"
namespace sqlite_export_service_internal {
void ThrowSqliteError(sqlite3* db_handle, std::string_view action);
SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path);
void ExecSql(const SqliteDatabaseHandle& db_handle, std::string_view sql,
const char* action);
void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept;
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
/** /**
* @file services/sqlite_export_service.h * @file services/sqlite_export_service.h
@@ -11,16 +11,17 @@
#include <string> #include <string>
#include <unordered_map> #include <unordered_map>
#include "services/date_time_provider.h" #include "data_model/models.h"
#include "services/export_service.h" #include "../datetime/date_time_provider.h"
#include "services/sqlite_export_service_helpers.h" #include "export_service.h"
#include "sqlite_export_service_helpers.h"
/** /**
* @brief Persists generated brewery records into a fresh SQLite database. * @brief Persists generated brewery records into a fresh SQLite database.
*/ */
class SqliteExportService final : public IExportService { class SqliteExportService final : public IExportService {
public: public:
SqliteExportService(); explicit SqliteExportService(const ApplicationOptions& options);
~SqliteExportService() override; ~SqliteExportService() override;
SqliteExportService(const SqliteExportService&) = delete; SqliteExportService(const SqliteExportService&) = delete;
@@ -29,7 +30,7 @@ class SqliteExportService final : public IExportService {
SqliteExportService& operator=(SqliteExportService&&) = delete; SqliteExportService& operator=(SqliteExportService&&) = delete;
void Initialize() override; void Initialize() override;
void ProcessRecord(const GeneratedBrewery& brewery) override; uint64_t ProcessRecord(const GeneratedBrewery& brewery) override;
void Finalize() override; void Finalize() override;
private: private:
@@ -38,15 +39,15 @@ class SqliteExportService final : public IExportService {
using SqliteStatementHandle = using SqliteStatementHandle =
sqlite_export_service_internal::SqliteStatementHandle; sqlite_export_service_internal::SqliteStatementHandle;
void InitializeSchema(); void InitializeSchema() const;
void PrepareStatements(); void PrepareStatements();
void RollbackAndCloseNoThrow() noexcept; void RollbackAndCloseNoThrow() noexcept;
void FinalizeStatements() noexcept;
[[nodiscard]] std::filesystem::path BuildDatabasePath() const; [[nodiscard]] std::filesystem::path BuildDatabasePath() const;
[[nodiscard]] static std::string BuildLocationKey(const Location& location); [[nodiscard]] static std::string BuildLocationKey(const Location& location);
std::unique_ptr<IDateTimeProvider> date_time_provider_; std::unique_ptr<IDateTimeProvider> date_time_provider_;
std::filesystem::path output_path_;
std::string run_timestamp_utc_; std::string run_timestamp_utc_;
std::filesystem::path database_path_; std::filesystem::path database_path_;
SqliteDatabaseHandle db_handle_; SqliteDatabaseHandle db_handle_;
@@ -56,4 +57,4 @@ class SqliteExportService final : public IExportService {
std::unordered_map<std::string, sqlite3_int64> location_cache_; std::unordered_map<std::string, sqlite3_int64> location_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_

View File

@@ -0,0 +1,10 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "sqlite_connection_helpers.h"
#include "sqlite_handle_types.h"
#include "sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -0,0 +1,36 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
/**
* Shared handle and parameter type declarations used by SQLite helper units.
*/
#include <sqlite3.h>
#include <memory>
#include <string_view>
namespace sqlite_export_service_internal {
struct SqliteDatabaseDeleter {
void operator()(sqlite3* handle) const noexcept;
};
struct SqliteStatementDeleter {
void operator()(sqlite3_stmt* statement) const noexcept;
};
using SqliteDatabaseHandle = std::unique_ptr<sqlite3, SqliteDatabaseDeleter>;
using SqliteStatementHandle =
std::unique_ptr<sqlite3_stmt, SqliteStatementDeleter>;
template <typename T>
struct BindParam {
int index;
T value;
std::string_view action;
};
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_

View File

@@ -0,0 +1,116 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
/**
* @file services/sqlite_statement_helpers.h
* @brief Declarations for statement-level SQLite helper functions and
* constants.
*/
#include <sqlite3.h>
#include <string>
#include <string_view>
#include <vector>
#include "sqlite_handle_types.h"
namespace sqlite_export_service_internal {
inline constexpr std::string_view kCreateLocationsTableSql = R"sql(
CREATE TABLE IF NOT EXISTS locations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
city TEXT NOT NULL,
state_province TEXT NOT NULL,
iso3166_2 TEXT NOT NULL,
country TEXT NOT NULL,
iso3166_1 TEXT NOT NULL,
local_languages_json TEXT NOT NULL,
latitude REAL NOT NULL,
longitude REAL NOT NULL,
UNIQUE(city, state_province, iso3166_2, country, latitude, longitude)
);
)sql";
inline constexpr std::string_view kCreateBreweriesTableSql = R"sql(
CREATE TABLE IF NOT EXISTS breweries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
location_id INTEGER NOT NULL,
name_en TEXT NOT NULL,
description_en TEXT NOT NULL,
name_local TEXT NOT NULL,
description_local TEXT NOT NULL,
FOREIGN KEY(location_id) REFERENCES locations(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_breweries_location_id ON breweries(location_id);
)sql";
inline constexpr std::string_view kInsertLocationSql = R"sql(
INSERT INTO locations (
city,
state_province,
iso3166_2,
country,
iso3166_1,
local_languages_json,
latitude,
longitude
) VALUES (?, ?, ?, ?, ?, ?, ?, ?);
)sql";
inline constexpr std::string_view kInsertBrewerySql = R"sql(
INSERT INTO breweries (
location_id,
name_en,
description_en,
name_local,
description_local
) VALUES (?, ?, ?, ?, ?);
)sql";
inline constexpr int kLocationCityBindIndex = 1;
inline constexpr int kLocationStateProvinceBindIndex = 2;
inline constexpr int kLocationIso31662BindIndex = 3;
inline constexpr int kLocationCountryBindIndex = 4;
inline constexpr int kLocationIso31661BindIndex = 5;
inline constexpr int kLocationLanguagesBindIndex = 6;
inline constexpr int kLocationLatitudeBindIndex = 7;
inline constexpr int kLocationLongitudeBindIndex = 8;
inline constexpr int kBreweryLocationIdBindIndex = 1;
inline constexpr int kBreweryEnglishNameBindIndex = 2;
inline constexpr int kBreweryEnglishDescriptionBindIndex = 3;
inline constexpr int kBreweryLocalNameBindIndex = 4;
inline constexpr int kBreweryLocalDescriptionBindIndex = 5;
SqliteStatementHandle PrepareStatement(const SqliteDatabaseHandle& db_handle,
std::string_view sql,
const char* action);
void ResetStatement(SqliteStatementHandle& statement);
void Bind(const SqliteStatementHandle& statement,
const BindParam<std::string_view>& param);
void Bind(const SqliteStatementHandle& statement,
const BindParam<double>& param);
void Bind(const SqliteStatementHandle& statement,
const BindParam<sqlite3_int64>& param);
void StepStatement(const SqliteDatabaseHandle& db_handle,
const SqliteStatementHandle& statement,
std::string_view action);
sqlite3_int64 LastInsertRowId(const SqliteDatabaseHandle& db_handle);
std::string SerializeVector(const std::vector<std::string>& str_vec);
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
/** /**
* @file services/date_time_provider.h * @file services/date_time_provider.h
@@ -63,4 +63,4 @@ class SystemDateTimeProvider final : public IDateTimeProvider {
} }
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_

View File

@@ -0,0 +1,35 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#include <chrono>
/**
* @file services/timer.h
* @brief Simple timer utility for measuring elapsed time.
*/
class Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
public:
Timer(const Timer&) = delete;
Timer& operator=(const Timer&) = delete;
Timer(Timer&&) = delete;
Timer& operator=(Timer&&) = delete;
Timer() = default;
~Timer() = default;
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
[[nodiscard]] int64_t Reset() {
auto previous_elapsed = Elapsed();
start_time = std::chrono::steady_clock::now();
return previous_elapsed;
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
/** /**
* @file services/enrichment_service.h * @file services/enrichment_service.h
@@ -8,7 +8,7 @@
#include <string> #include <string>
#include "data_model/location.h" #include "data_model/models.h"
/** /**
* @brief Interface for services that can enrich a location with context. * @brief Interface for services that can enrich a location with context.
@@ -27,4 +27,4 @@ class IEnrichmentService {
virtual std::string GetLocationContext(const Location& loc) = 0; virtual std::string GetLocationContext(const Location& loc) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_

View File

@@ -0,0 +1,17 @@
//
// Created by aaronpo on 13/05/2026.
//
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#include <string>
#include "enrichment_service.h"
class MockEnrichmentService final : public IEnrichmentService {
public:
std::string GetLocationContext(const Location& /*loc*/) override {
return {};
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
/** /**
* @file services/wikipedia_service.h * @file services/wikipedia_service.h
@@ -11,14 +11,14 @@
#include <string_view> #include <string_view>
#include <unordered_map> #include <unordered_map>
#include "services/enrichment_service.h" #include "enrichment_service.h"
#include "web_client/web_client.h" #include "web_client/web_client.h"
/// @brief Provides Wikipedia summary lookups backed by cached raw extracts. /// @brief Provides Wikipedia summary lookups backed by cached raw extracts.
class WikipediaService final : public IEnrichmentService { class WikipediaEnrichmentService final : public IEnrichmentService {
public: public:
/// @brief Creates a new Wikipedia service with the provided web client. /// @brief Creates a new Wikipedia service with the provided web client.
explicit WikipediaService(std::unique_ptr<WebClient> client); explicit WikipediaEnrichmentService(std::unique_ptr<WebClient> client);
/// @brief Returns the Wikipedia-derived context for a location. /// @brief Returns the Wikipedia-derived context for a location.
[[nodiscard]] std::string GetLocationContext(const Location& loc) override; [[nodiscard]] std::string GetLocationContext(const Location& loc) override;
@@ -30,4 +30,4 @@ class WikipediaService final : public IEnrichmentService {
std::unordered_map<std::string, std::string> extract_cache_; std::unordered_map<std::string, std::string> extract_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_

View File

@@ -0,0 +1,76 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
/**
* @file services/prompt_directory.h
* @brief Interface and filesystem-backed implementation for named prompt
* loading.
*
* Prompt files are resolved by key: a key of "BREWERY_GENERATION" maps to the
* file <prompt_dir>/BREWERY_GENERATION.md. The interface is kept intentionally
* narrow so test doubles can be injected without touching the filesystem.
*/
#include <filesystem>
#include <stdexcept>
#include <string>
#include <string_view>
#include <unordered_map>
/**
* @brief Interface for loading named prompt files.
*/
class IPromptDirectory {
public:
IPromptDirectory() = default;
IPromptDirectory(const IPromptDirectory&) = delete;
IPromptDirectory& operator=(const IPromptDirectory&) = delete;
IPromptDirectory(IPromptDirectory&&) = delete;
IPromptDirectory& operator=(IPromptDirectory&&) = delete;
virtual ~IPromptDirectory() = default;
/**
* @brief Loads the prompt associated with @p key.
*
* @param key Logical prompt key, e.g. "BREWERY_GENERATION".
* @return Prompt text.
* @throws std::runtime_error if the prompt file cannot be found or read.
*/
[[nodiscard]] virtual std::string Load(std::string_view key) = 0;
};
/**
* @brief Filesystem-backed IPromptDirectory implementation.
*
* Each call to Load() checks an in-process cache first, then reads
* <prompt_dir>/<key>.md from disk. The directory must exist and be readable
* at construction time; individual file absence is reported lazily at Load().
*/
class PromptDirectory final : public IPromptDirectory {
public:
/**
* @brief Constructs a PromptDirectory rooted at @p prompt_dir.
*
* @param prompt_dir Absolute or relative path to the prompt directory.
* @throws std::runtime_error if @p prompt_dir does not exist or is not a
* directory.
*/
explicit PromptDirectory(const std::filesystem::path& prompt_dir);
/**
* @brief Loads the prompt for @p key, caching the result.
*
* Maps @p key → <prompt_dir>/<key>.md.
*
* @param key Logical prompt key.
* @return Prompt text.
* @throws std::runtime_error if the file does not exist or is empty.
*/
[[nodiscard]] std::string Load(std::string_view key) override;
private:
std::filesystem::path prompt_dir_;
std::unordered_map<std::string, std::string> cache_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_

View File

@@ -1,250 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
/**
* @file services/sqlite_export_service_helpers.h
* @brief Internal SQLite export helpers shared across per-method translation
* units.
*/
#include <sqlite3.h>
#include <boost/json.hpp>
#include <cstddef>
#include <cstring>
#include <filesystem>
#include <limits>
#include <memory>
#include <stdexcept>
#include <string>
#include <string_view>
#include <vector>
namespace sqlite_export_service_internal {
struct SqliteDatabaseDeleter {
void operator()(sqlite3* handle) const noexcept {
if (handle != nullptr) {
sqlite3_close(handle);
}
}
};
struct SqliteStatementDeleter {
void operator()(sqlite3_stmt* statement) const noexcept {
if (statement != nullptr) {
sqlite3_finalize(statement);
}
}
};
using SqliteDatabaseHandle = std::unique_ptr<sqlite3, SqliteDatabaseDeleter>;
using SqliteStatementHandle =
std::unique_ptr<sqlite3_stmt, SqliteStatementDeleter>;
inline constexpr std::string_view kCreateLocationsTableSql = R"sql(
CREATE TABLE IF NOT EXISTS locations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
city TEXT NOT NULL,
state_province TEXT NOT NULL,
iso3166_2 TEXT NOT NULL,
country TEXT NOT NULL,
iso3166_1 TEXT NOT NULL,
local_languages_json TEXT NOT NULL,
latitude REAL NOT NULL,
longitude REAL NOT NULL,
UNIQUE(city, state_province, iso3166_2, country, latitude, longitude)
);
)sql";
inline constexpr std::string_view kCreateBreweriesTableSql = R"sql(
CREATE TABLE IF NOT EXISTS breweries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
location_id INTEGER NOT NULL,
name_en TEXT NOT NULL,
description_en TEXT NOT NULL,
name_local TEXT NOT NULL,
description_local TEXT NOT NULL,
FOREIGN KEY(location_id) REFERENCES locations(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_breweries_location_id ON breweries(location_id);
)sql";
inline constexpr std::string_view kInsertLocationSql = R"sql(
INSERT INTO locations (
city,
state_province,
iso3166_2,
country,
iso3166_1,
local_languages_json,
latitude,
longitude
) VALUES (?, ?, ?, ?, ?, ?, ?, ?);
)sql";
inline constexpr std::string_view kInsertBrewerySql = R"sql(
INSERT INTO breweries (
location_id,
name_en,
description_en,
name_local,
description_local
) VALUES (?, ?, ?, ?, ?);
)sql";
inline constexpr int kLocationCityBindIndex = 1;
inline constexpr int kLocationStateProvinceBindIndex = 2;
inline constexpr int kLocationIso31662BindIndex = 3;
inline constexpr int kLocationCountryBindIndex = 4;
inline constexpr int kLocationIso31661BindIndex = 5;
inline constexpr int kLocationLanguagesBindIndex = 6;
inline constexpr int kLocationLatitudeBindIndex = 7;
inline constexpr int kLocationLongitudeBindIndex = 8;
inline constexpr int kBreweryLocationIdBindIndex = 1;
inline constexpr int kBreweryEnglishNameBindIndex = 2;
inline constexpr int kBreweryEnglishDescriptionBindIndex = 3;
inline constexpr int kBreweryLocalNameBindIndex = 4;
inline constexpr int kBreweryLocalDescriptionBindIndex = 5;
inline void ThrowSqliteError(sqlite3* db_handle, std::string_view action) {
const std::string message =
db_handle != nullptr ? sqlite3_errmsg(db_handle) : "unknown SQLite error";
throw std::runtime_error(std::string(action) + ": " + message);
}
inline SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path) {
sqlite3* raw_handle = nullptr;
const std::string path_string = path.string();
const int result = sqlite3_open(path_string.c_str(), &raw_handle);
SqliteDatabaseHandle handle(raw_handle);
if (result != SQLITE_OK) {
const std::string message = raw_handle != nullptr
? sqlite3_errmsg(raw_handle)
: "unknown SQLite error";
throw std::runtime_error("Failed to open SQLite export database: " +
message);
}
return handle;
}
inline void ExecSql(const SqliteDatabaseHandle& db_handle, std::string_view sql,
const char* action) {
char* error_message = nullptr;
const std::string sql_text(sql);
const int result = sqlite3_exec(db_handle.get(), sql_text.c_str(), nullptr,
nullptr, &error_message);
if (result != SQLITE_OK) {
const std::string message = error_message != nullptr
? error_message
: sqlite3_errmsg(db_handle.get());
sqlite3_free(error_message);
throw std::runtime_error(std::string(action) + ": " + message);
}
}
inline SqliteStatementHandle PrepareStatement(
const SqliteDatabaseHandle& db_handle, std::string_view sql,
const char* action) {
sqlite3_stmt* raw_statement = nullptr;
const std::string sql_text(sql);
const int result = sqlite3_prepare_v2(db_handle.get(), sql_text.c_str(), -1,
&raw_statement, nullptr);
SqliteStatementHandle statement(raw_statement);
if (result != SQLITE_OK) {
ThrowSqliteError(db_handle.get(), action);
}
return statement;
}
inline void ResetStatement(SqliteStatementHandle& statement) {
if (statement != nullptr) {
sqlite3_reset(statement.get());
sqlite3_clear_bindings(statement.get());
}
}
inline void DeleteCharArray(void* data) noexcept {
delete[] static_cast<char*>(data);
}
inline void BindText(const SqliteStatementHandle& statement, int index,
std::string_view value, const char* action) {
const auto byte_count = value.size();
if (byte_count > static_cast<std::size_t>(std::numeric_limits<int>::max())) {
ThrowSqliteError(sqlite3_db_handle(statement.get()), action);
}
auto buffer = std::make_unique<char[]>(byte_count + 1);
std::memcpy(buffer.get(), value.data(), byte_count);
buffer[byte_count] = '\0';
char* raw_buffer = buffer.release();
const int result =
sqlite3_bind_text(statement.get(), index, raw_buffer,
static_cast<int>(byte_count), DeleteCharArray);
if (result != SQLITE_OK) {
DeleteCharArray(raw_buffer);
ThrowSqliteError(sqlite3_db_handle(statement.get()), action);
}
}
inline void BindDouble(const SqliteStatementHandle& statement, int index,
double value, std::string_view action) {
const int result = sqlite3_bind_double(statement.get(), index, value);
if (result != SQLITE_OK) {
ThrowSqliteError(sqlite3_db_handle(statement.get()), action);
}
}
inline void BindInt64(const SqliteStatementHandle& statement, int index,
sqlite3_int64 value, std::string_view action) {
const int result = sqlite3_bind_int64(statement.get(), index, value);
if (result != SQLITE_OK) {
ThrowSqliteError(sqlite3_db_handle(statement.get()), action);
}
}
inline void StepStatement(const SqliteDatabaseHandle& db_handle,
const SqliteStatementHandle& statement,
std::string_view action) {
const int result = sqlite3_step(statement.get());
if (result != SQLITE_DONE) {
ThrowSqliteError(db_handle.get(), action);
}
}
inline sqlite3_int64 LastInsertRowId(const SqliteDatabaseHandle& db_handle) {
return sqlite3_last_insert_rowid(db_handle.get());
}
inline void RollbackTransactionNoThrow(
const SqliteDatabaseHandle& db_handle) noexcept {
if (!db_handle) {
return;
}
sqlite3_exec(db_handle.get(), "ROLLBACK;", nullptr, nullptr, nullptr);
}
inline std::string SerializeLocalLanguages(
const std::vector<std::string>& local_languages) {
boost::json::array array;
array.reserve(local_languages.size());
for (const auto& language : local_languages) {
array.emplace_back(language);
}
return boost::json::serialize(array);
}
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,54 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
/**
* @file web_client/curl_web_client.h
* @brief libcurl-based WebClient implementation.
*/
#include "web_client/web_client.h"
/**
* @brief RAII wrapper for curl_global_init and curl_global_cleanup.
*
* Create one instance in application startup before using libcurl and keep it
* alive for application lifetime.
*/
class CurlGlobalState {
public:
/// @brief Initializes global libcurl state.
CurlGlobalState();
/// @brief Cleans up global libcurl state.
~CurlGlobalState();
/// @brief Non-copyable type.
CurlGlobalState(const CurlGlobalState&) = delete;
/// @brief Non-copyable type.
CurlGlobalState& operator=(const CurlGlobalState&) = delete;
};
/**
* @brief WebClient implementation backed by libcurl.
*/
class CURLWebClient : public WebClient {
public:
/**
* @brief Executes an HTTP GET request.
*
* @param url Request URL.
* @return Response body.
*/
std::string Get(const std::string& url) override;
/**
* @brief URL-encodes a string value.
*
* @param value Raw value.
* @return URL-encoded string.
*/
std::string UrlEncode(const std::string& value) override;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_

View File

@@ -0,0 +1,49 @@
/**
* @file web_client/http_web_client.h
* @brief cpp-httplib implementation of the WebClient interface.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#include "web_client/web_client.h"
#include <string>
/**
* @brief WebClient implementation backed by cpp-httplib.
*
* Supports HTTP and HTTPS (requires OpenSSL; see HTTPLIB_REQUIRE_OPENSSL
* in CMakeLists.txt).
*
* URL parsing splits a full URL into origin (scheme://host[:port]) and
* path + query so that httplib::Client can be constructed correctly.
* A new client instance is created per request because the client is
* bound to a single origin at construction time.
*/
class HttpWebClient final : public WebClient {
public:
HttpWebClient() = default;
~HttpWebClient() override = default;
/**
* @brief Executes a blocking HTTP/HTTPS GET request against a full URL.
*
* @param url Fully-qualified URL, e.g. "https://en.wikipedia.org/api/rest_v1/page/summary/Berlin"
* @return Response body on HTTP 2xx; throws std::runtime_error otherwise.
*/
std::string Get(const std::string& url) override;
/**
* @brief Percent-encodes a single URI component (query parameter value or
* path segment). Delegates to httplib::encode_uri_component().
*
* @param value Raw string to encode.
* @return Percent-encoded string safe for use in a URL.
*/
std::string EncodeURL(const std::string& value) override;
};
#endif

View File

@@ -30,7 +30,7 @@ class WebClient {
* @param value Raw string value. * @param value Raw string value.
* @return Encoded value safe for URL usage. * @return Encoded value safe for URL usage.
*/ */
virtual std::string UrlEncode(const std::string& value) = 0; virtual std::string EncodeURL(const std::string& value) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_

View File

@@ -0,0 +1,9 @@
# Ignore model files!
*.gguf
*.bin
models/
weights/
# Ignore local build folders
build/
.git/

View File

@@ -0,0 +1,72 @@
# --- Stage 1: Build Environment (The "Heavy" Stage) ---
FROM nvidia/cuda:12.6.3-devel-ubuntu24.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive \
CMAKE_GENERATOR=Ninja
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential ca-certificates curl git libboost-json-dev \
libboost-program-options-dev libssl-dev ninja-build pkg-config zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install modern CMake
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.31.0/cmake-3.31.0-linux-x86_64.sh -o cmake.sh && \
sh cmake.sh --skip-license --prefix=/usr/local && rm cmake.sh
# Get headers for C++ build
RUN curl -L https://github.com/ggml-org/llama.cpp/archive/refs/tags/b9012.tar.gz -o /tmp/llama-src.tar.gz && \
tar -xzf /tmp/llama-src.tar.gz -C /tmp && \
cp -r /tmp/llama.cpp-b9012/include/* /usr/local/include/ && \
cp -r /tmp/llama.cpp-b9012/ggml/include/* /usr/local/include/
# Pull llama.cpp binaries to use during build if needed
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
WORKDIR /app
COPY . .
# Build the C++ pipeline
RUN cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc)
# --- Stage 2: Runtime Environment (The "Slim" Stage) ---
FROM nvidia/cuda:12.6.3-runtime-ubuntu24.04 AS runtime
# Install only necessary runtime shared libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
libboost-json1.83.0 \
libboost-program-options1.83.0 \
libgomp1 \
libssl3 \
zlib1g \
&& rm -rf /var/lib/apt/lists/*
ENV APP_ROOT=/app \
LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"
WORKDIR /app/build
# Copy only the compiled binaries from the builder
COPY --from=builder /app/build/biergarten-pipeline ./
# Copy required config files
COPY locations.json /app/build/
COPY beer-styles.json /app/build/
# Copy prompt templates
COPY prompts /app/prompts
# Copy only the necessary shared libraries from builder/llama-bin
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
# Co-locate plugins
RUN cp /usr/local/lib/libggml-cuda.so . 2>/dev/null || true && \
cp /usr/local/lib/libggml-cpu*.so . 2>/dev/null || true
# Setup Start Script
COPY ./runpod/start.sh /usr/local/bin/biergarten-start
RUN chmod +x /usr/local/bin/biergarten-start
ENTRYPOINT ["/usr/local/bin/biergarten-start"]

View File

@@ -0,0 +1,8 @@
```bash
touch runpod/start.sh
docker build \
--progress=plain \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```

View File

@@ -0,0 +1,22 @@
name: biergarten-pipeline-live
imageName: biergarten-pipeline:latest
category: NVIDIA
containerDiskInGb: 50
volumeInGb: 50
volumeMountPath: /workspace
dockerEntrypoint:
- /usr/local/bin/biergarten-start
dockerStartCmd: []
isPublic: false
isServerless: false
env:
BIERGARTEN_MODE: live
BIERGARTEN_MODEL_PATH: /workspace/models/google_gemma-4-E4B-it-Q6_K.gguf
BIERGARTEN_PROMPT_DIR: /workspace/app/build/prompts
BIERGARTEN_OUTPUT_DIR: /workspace/output
BIERGARTEN_LOG_PATH: /workspace/logs/pipeline.log
BIERGARTEN_TEMPERATURE: "1.0"
BIERGARTEN_TOP_P: "0.95"
BIERGARTEN_TOP_K: "64"
BIERGARTEN_N_CTX: "8192"
BIERGARTEN_SEED: "-1"

View File

@@ -0,0 +1,58 @@
#!/bin/bash
set -e
MODEL_PATH="${BIERGARTEN_MODEL_PATH:-/workspace/models/google_gemma-4-E4B-it-Q6_K.gguf}"
OUTPUT_DIR="${BIERGARTEN_OUTPUT_DIR:-/workspace/output}"
LOG_PATH="${BIERGARTEN_LOG_PATH:-/workspace/logs/pipeline.log}"
EXECUTABLE="/app/build/biergarten-pipeline"
PROMPT_DIR="/app/prompts"
echo "--- Starting Biergarten Pipeline Environment Check ---"
# Ensure directories exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$(dirname "$LOG_PATH")"
mkdir -p "$(dirname "$MODEL_PATH")"
# Download model if missing
if [ ! -f "$MODEL_PATH" ]; then
echo "Model not found. Downloading (this may take a while)..."
curl -L -C - \
-o "$MODEL_PATH" \
"https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true"
echo "Download complete."
fi
# Verify model exists
if [ ! -f "$MODEL_PATH" ]; then
echo "ERROR: Model still not found after download attempt."
exit 1
fi
# Default GPU layers
GL_LAYERS="${BIERGARTEN_GL_LAYERS:-40}"
# Build args
ARGS=(
"--model" "$MODEL_PATH"
"--prompt-dir" "$PROMPT_DIR"
"--output" "$OUTPUT_DIR"
"--log-path" "$LOG_PATH"
"--n-gpu-layers" "$GL_LAYERS"
)
# Optional params
[[ -n "$BIERGARTEN_TEMPERATURE" ]] && ARGS+=("--temperature" "$BIERGARTEN_TEMPERATURE")
[[ -n "$BIERGARTEN_TOP_P" ]] && ARGS+=("--top-p" "$BIERGARTEN_TOP_P")
[[ -n "$BIERGARTEN_TOP_K" ]] && ARGS+=("--top-k" "$BIERGARTEN_TOP_K")
[[ -n "$BIERGARTEN_N_CTX" ]] && ARGS+=("--n-ctx" "$BIERGARTEN_N_CTX")
[[ -n "$BIERGARTEN_SEED" ]] && ARGS+=("--seed" "$BIERGARTEN_SEED")
# Extra args
[[ -n "$BIERGARTEN_EXTRA_ARGS" ]] && ARGS+=($BIERGARTEN_EXTRA_ARGS)
echo "--- Executing: $EXECUTABLE ${ARGS[*]} ---"
exec "$EXECUTABLE" "${ARGS[@]}"

View File

@@ -0,0 +1,158 @@
#include <spdlog/spdlog.h>
#include <optional>
#include <sstream>
#include <string>
#include "data_model/models.h"
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Defaults sourced from SamplingOptions{} so the CLI and LlamaGenerator
// share a single source of truth — changing the struct updates both.
auto add_sampling_options = [&]() -> void {
const SamplingOptions sampling_defaults{};
opt("temperature",
prog_opts::value<float>()->default_value(sampling_defaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(sampling_defaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(sampling_defaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
opt("n-gpu-layers", prog_opts::value<int>()->default_value(0),
"Number of layers to offload to GPU");
};
// --mocked and --model are mutually exclusive; validation is enforced below
// rather than at registration to produce a clear diagnostic message.
auto add_generator_options = [&]() -> void {
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
};
auto add_pipeline_options = [&]() -> void {
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
opt("location-count", prog_opts::value<uint32_t>()->default_value(10));
};
add_sampling_options();
add_generator_options();
add_pipeline_options();
// No flags provided — treat as a help request rather than an error.
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
options.pipeline.location_count =
var_map["location-count"].as<uint32_t>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
const int n_gpu_layers = var_map["n-gpu-layers"].as<int>();
// Enforce mutual exclusivity before any further configuration is applied.
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: either --mocked or --model must be specified");
return std::nullopt;
}
// Prompt directory is only meaningful for live inference — the mock
// generator has no use for it and should not require it to be present.
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
spdlog::error(
"Invalid arguments: --prompt-dir is required when not using "
"--mocked");
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
// options.generator.n_gpu_layers = n_gpu_layers;
// Only populate sampling config when the user explicitly overrides at
// least one value. Leaving it as std::nullopt lets LlamaGenerator fall
// back to its own SamplingOptions{} defaults, keeping the two paths
// consistent without redundant copies.
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted() || !var_map["n_gpu_layers"].defaulted();
if (user_provided_sampling) {
// Warn but do not fail — the run is still valid, the flags are just
// silently irrelevant when no model is loaded.
if (use_mocked) {
spdlog::warn("Sampling parameters are ignored when using --mocked");
} else {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
sampling.n_gpu_layers = var_map["n-gpu-layers"].as<int>();
options.generator.sampling = sampling;
}
}
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}

View File

@@ -10,7 +10,9 @@
BiergartenDataGenerator::BiergartenDataGenerator( BiergartenDataGenerator::BiergartenDataGenerator(
std::unique_ptr<IEnrichmentService> context_service, std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator, std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter) std::unique_ptr<IExportService> exporter,
const ApplicationOptions &app_options)
: context_service_(std::move(context_service)), : context_service_(std::move(context_service)),
generator_(std::move(generator)), generator_(std::move(generator)),
exporter_(std::move(exporter)) {} exporter_(std::move(exporter)),
application_options_(app_options) {}

View File

@@ -13,8 +13,6 @@
#include "biergarten_data_generator.h" #include "biergarten_data_generator.h"
#include "json_handling/json_loader.h" #include "json_handling/json_loader.h"
static constexpr size_t kBreweryAmount = 50;
std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() { std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ==="); spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ===");
@@ -23,7 +21,9 @@ std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
auto all_locations = JsonLoader::LoadLocations(locations_path); auto all_locations = JsonLoader::LoadLocations(locations_path);
spdlog::info(" Locations available: {}", all_locations.size()); spdlog::info(" Locations available: {}", all_locations.size());
const size_t sample_count = std::min(kBreweryAmount, all_locations.size()); const size_t sample_count = std::min(
static_cast<size_t>(application_options_.pipeline.location_count),
all_locations.size());
const auto sample_count_signed = const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>( static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(

View File

@@ -21,8 +21,8 @@ bool BiergartenDataGenerator::Run() {
for (auto& city : cities) { for (auto& city : cities) {
try { try {
std::string region_context = context_service_->GetLocationContext(city); std::string region_context = context_service_->GetLocationContext(city);
spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}", // spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}",
city.city, city.country, region_context); // city.city, city.iso3166_2, region_context);
enriched.push_back( enriched.push_back(
EnrichedCity{.location = std::move(city), EnrichedCity{.location = std::move(city),

View File

@@ -33,6 +33,9 @@ static std::string FormatLocalLanguageCodes(
return formatted; return formatted;
} }
// GBNF grammar for structured brewery JSON output.
// @TODO move to a separate gbnf file if it grows in complexity or is shared
// across modules.
static constexpr std::string_view kBreweryJsonGrammar = R"json_brewery( static constexpr std::string_view kBreweryJsonGrammar = R"json_brewery(
root ::= thought-block "{" ws "\"name_en\"" ws ":" ws string ws "," ws "\"description_en\"" ws ":" ws string ws "," ws "\"name_local\"" ws ":" ws string ws "," ws "\"description_local\"" ws ":" ws string ws "}" ws root ::= thought-block "{" ws "\"name_en\"" ws ":" ws string ws "," ws "\"description_en\"" ws ":" ws string ws "," ws "\"name_local\"" ws ":" ws string ws "," ws "\"description_local\"" ws ":" ws string ws "}" ws
thought-block ::= [^{]* thought-block ::= [^{]*
@@ -59,11 +62,12 @@ BreweryResult LlamaGenerator::GenerateBrewery(
location.country.empty() ? std::string{} location.country.empty() ? std::string{}
: std::format(", {}", location.country); : std::format(", {}", location.country);
/** /**
* Load brewery system prompt from file * Load brewery system prompt via the injected prompt directory.
* Falls back to minimal inline prompt if file not found * The key "BREWERY_GENERATION" resolves to BREWERY_GENERATION.md inside
* the configured --prompt-dir. Throws on missing or empty file.
*/ */
const std::string system_prompt = const std::string system_prompt =
LoadBrewerySystemPrompt("prompts/system.md"); prompt_directory_->Load("BREWERY_GENERATION");
std::string user_prompt = std::format( std::string user_prompt = std::format(
"## CITY:\n{}\n\n## COUNTRY:\n{}\n\n## LOCAL LANGUAGE CODES:\n{}\n\n## " "## CITY:\n{}\n\n## COUNTRY:\n{}\n\n## LOCAL LANGUAGE CODES:\n{}\n\n## "

View File

@@ -12,6 +12,13 @@
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "data_generation/llama_generator_helpers.h" #include "data_generation/llama_generator_helpers.h"
// TODO: Implement locale-aware user profile generation.
// Current implementation returns a hardcoded test value and ignores the
// locale parameter. Future implementation should:
// 1. Load a USER_GENERATION.md prompt template with locale context
// 2. Perform LLM inference with locale-specific username/bio generation
// 3. Parse and validate JSON output with retry handling (similar to brewery)
// 4. Return locale-aware username and biography
UserResult LlamaGenerator::GenerateUser(const std::string& locale) { UserResult LlamaGenerator::GenerateUser(const std::string& locale) {
return {.username = "test_user", return {.username = "test_user",
.bio = "This is a test user profile from " + locale + "."}; .bio = "This is a test user profile from " + locale + "."};

View File

@@ -58,6 +58,11 @@ static std::string CondenseWhitespace(std::string_view text) {
return out; return out;
} }
// Guard against truncating in the first half of the string.
// This preserves the critical opening content and avoids cutting critical
// context words early in the region description.
static constexpr size_t kTruncationGuardDivisor = 2;
/** /**
* Truncate region context to fit within max length while preserving word * Truncate region context to fit within max length while preserving word
* boundaries * boundaries
@@ -71,7 +76,8 @@ std::string PrepareRegionContext(std::string_view region_context,
normalized.resize(max_chars); normalized.resize(max_chars);
const size_t last_space = normalized.find_last_of(' '); const size_t last_space = normalized.find_last_of(' ');
if (last_space != std::string::npos && last_space > max_chars / 2) { if (last_space != std::string::npos &&
last_space > max_chars / kTruncationGuardDivisor) {
normalized.resize(last_space); normalized.resize(last_space);
} }

View File

@@ -19,6 +19,9 @@
#include "llama.h" #include "llama.h"
static constexpr size_t kPromptTokenSlack = 8; static constexpr size_t kPromptTokenSlack = 8;
// Minimum tokens to keep when using top-p sampling. Ensures at least one
// candidate token remains available even with very restrictive top-p values.
static constexpr size_t kTopPMinKeep = 1;
namespace { namespace {
@@ -62,7 +65,7 @@ SamplerHandle MakeSamplerChain(const llama_vocab* vocab,
"LlamaGenerator: failed to initialize temperature sampler"); "LlamaGenerator: failed to initialize temperature sampler");
add_sampler(llama_sampler_init_top_k(static_cast<int32_t>(config.top_k)), add_sampler(llama_sampler_init_top_k(static_cast<int32_t>(config.top_k)),
"LlamaGenerator: failed to initialize top-k sampler"); "LlamaGenerator: failed to initialize top-k sampler");
add_sampler(llama_sampler_init_top_p(config.top_p, 1), add_sampler(llama_sampler_init_top_p(config.top_p, kTopPMinKeep),
"LlamaGenerator: failed to initialize top-p sampler"); "LlamaGenerator: failed to initialize top-p sampler");
add_sampler(llama_sampler_init_dist(config.seed), add_sampler(llama_sampler_init_dist(config.seed),
"LlamaGenerator: failed to initialize distribution sampler"); "LlamaGenerator: failed to initialize distribution sampler");

View File

@@ -11,7 +11,7 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "data_model/application_options.h" #include "data_model/models.h"
#include "llama.h" #include "llama.h"
static constexpr uint32_t kMaxContextSize = 32768U; static constexpr uint32_t kMaxContextSize = 32768U;
@@ -32,9 +32,11 @@ void LlamaGenerator::ContextDeleter::operator()(
LlamaGenerator::LlamaGenerator( LlamaGenerator::LlamaGenerator(
const ApplicationOptions& options, const std::string& model_path, const ApplicationOptions& options, const std::string& model_path,
std::unique_ptr<IPromptFormatter> prompt_formatter) std::unique_ptr<IPromptFormatter> prompt_formatter,
std::unique_ptr<IPromptDirectory> prompt_directory)
: rng_(std::random_device{}()), : rng_(std::random_device{}()),
prompt_formatter_(std::move(prompt_formatter)) { prompt_formatter_(std::move(prompt_formatter)),
prompt_directory_(std::move(prompt_directory)) {
if (model_path.empty()) { if (model_path.empty()) {
throw std::runtime_error("LlamaGenerator: model path must not be empty"); throw std::runtime_error("LlamaGenerator: model path must not be empty");
} }
@@ -44,41 +46,50 @@ LlamaGenerator::LlamaGenerator(
"LlamaGenerator: prompt formatter dependency must not be null"); "LlamaGenerator: prompt formatter dependency must not be null");
} }
if (options.temperature < 0.0F) { if (!prompt_directory_) {
throw std::runtime_error(
"LlamaGenerator: prompt directory dependency must not be null");
}
const auto sampling = options.generator.sampling.value_or(SamplingOptions{});
if (sampling.temperature < 0.0F) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: sampling temperature must be >= 0"); "LlamaGenerator: sampling temperature must be >= 0");
} }
if (options.top_p <= 0.0F || options.top_p > 1.0F) { if (sampling.top_p <= 0.0F || sampling.top_p > 1.0F) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: sampling top-p must be in (0, 1]"); "LlamaGenerator: sampling top-p must be in (0, 1]");
} }
if (options.top_k == 0U) { if (sampling.top_k == 0U) {
throw std::runtime_error("LlamaGenerator: sampling top-k must be > 0"); throw std::runtime_error("LlamaGenerator: sampling top-k must be > 0");
} }
if (options.seed < -1) { if (sampling.seed < -1) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: seed must be >= 0, or -1 for random"); "LlamaGenerator: seed must be >= 0, or -1 for random");
} }
if (options.n_ctx == 0 || options.n_ctx > kMaxContextSize) { if (sampling.n_ctx == 0 || sampling.n_ctx > kMaxContextSize) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: context size must be in range [1, 32768]"); "LlamaGenerator: context size must be in range [1, 32768]");
} }
sampling_temperature_ = options.temperature; sampling_temperature_ = sampling.temperature;
sampling_top_p_ = options.top_p; sampling_top_p_ = sampling.top_p;
sampling_top_k_ = options.top_k; sampling_top_k_ = sampling.top_k;
if (options.seed == -1) { if (sampling.seed == -1) {
std::random_device random_device; std::random_device random_device;
rng_.seed(random_device()); rng_.seed(random_device());
} else { } else {
rng_.seed(static_cast<uint32_t>(options.seed)); rng_.seed(static_cast<uint32_t>(sampling.seed));
} }
n_ctx_ = options.n_ctx;
n_ctx_ = sampling.n_ctx;
n_gpu_layers_ = sampling.n_gpu_layers;
this->Load(model_path); this->Load(model_path);
} }

View File

@@ -12,13 +12,23 @@
#include <utility> #include <utility>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "ggml-backend.h"
#include "llama.h" #include "llama.h"
// Maximum batch size for decode operations. Capping the batch prevents
// excessive memory allocation while maintaining inference performance.
static constexpr uint32_t kMaxBatchSize = 5000U;
void LlamaGenerator::Load(const std::string& model_path) { void LlamaGenerator::Load(const std::string& model_path) {
context_.reset(); context_.reset();
model_.reset(); model_.reset();
const llama_model_params model_params = llama_model_default_params(); // Specifically load dynamic ggml backends (like CUDA) that are provided
// externally before attempting to load a model.
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers_;
LlamaGenerator::ModelHandle loaded_model( LlamaGenerator::ModelHandle loaded_model(
llama_model_load_from_file(model_path.c_str(), model_params)); llama_model_load_from_file(model_path.c_str(), model_params));
if (!loaded_model) { if (!loaded_model) {
@@ -28,7 +38,7 @@ void LlamaGenerator::Load(const std::string& model_path) {
llama_context_params context_params = llama_context_default_params(); llama_context_params context_params = llama_context_default_params();
context_params.n_ctx = n_ctx_; context_params.n_ctx = n_ctx_;
context_params.n_batch = std::min(n_ctx_, static_cast<uint32_t>(5000)); context_params.n_batch = std::min(n_ctx_, kMaxBatchSize);
LlamaGenerator::ContextHandle loaded_context( LlamaGenerator::ContextHandle loaded_context(
llama_init_from_model(loaded_model.get(), context_params)); llama_init_from_model(loaded_model.get(), context_params));

View File

@@ -1,55 +0,0 @@
/**
* @file data_generation/llama/load_brewery_prompt.cc
* @brief Resolves brewery system prompt content from cache or a configured
* filesystem path and provides a robust inline fallback prompt when absent.
*/
#include <spdlog/spdlog.h>
#include <filesystem>
#include <fstream>
#include <stdexcept>
#include "data_generation/llama_generator.h"
/**
* @brief Loads brewery system prompt from disk or cache.
*
* @param prompt_file_path Preferred prompt file location.
* @return Prompt text loaded from disk.
*/
std::string LlamaGenerator::LoadBrewerySystemPrompt(
const std::filesystem::path& prompt_file_path) {
// Return cached version if already loaded
if (!brewery_system_prompt_.empty()) {
return brewery_system_prompt_;
}
std::ifstream prompt_file(prompt_file_path);
if (!prompt_file.is_open()) {
spdlog::error(
"LlamaGenerator: Failed to open brewery system prompt file '{}'",
prompt_file_path.string());
throw std::runtime_error(
"LlamaGenerator: missing brewery system prompt file: " +
prompt_file_path.string());
}
const std::string prompt((std::istreambuf_iterator(prompt_file)),
std::istreambuf_iterator<char>());
prompt_file.close();
if (prompt.empty()) {
spdlog::error("LlamaGenerator: Brewery system prompt file '{}' is empty",
prompt_file_path.string());
throw std::runtime_error(
"LlamaGenerator: empty brewery system prompt file: " +
prompt_file_path.string());
}
spdlog::info(
"LlamaGenerator: Loaded brewery system prompt from '{}' ({} chars)",
prompt_file_path.string(), prompt.length());
brewery_system_prompt_ = prompt;
return brewery_system_prompt_;
}

View File

@@ -17,9 +17,9 @@ BreweryResult MockGenerator::GenerateBrewery(
const std::string_view adjective = const std::string_view adjective =
kBreweryAdjectives.at(hash % kBreweryAdjectives.size()); kBreweryAdjectives.at(hash % kBreweryAdjectives.size());
const std::string_view noun = const std::string_view noun =
kBreweryNouns.at(hash / 7 % kBreweryNouns.size()); kBreweryNouns.at(hash / kNounHashStride % kBreweryNouns.size());
const std::string_view base_description = const std::string_view base_description = kBreweryDescriptions.at(
kBreweryDescriptions.at((hash / 13) % kBreweryDescriptions.size()); (hash / kDescriptionHashStride) % kBreweryDescriptions.size());
const std::string name = const std::string name =
std::format("{} {} {}", location.city, adjective, noun); std::format("{} {} {}", location.city, adjective, noun);

View File

@@ -15,7 +15,7 @@ UserResult MockGenerator::GenerateUser(const std::string& locale) {
UserResult result; UserResult result;
const std::string_view username = kUsernames[hash % kUsernames.size()]; const std::string_view username = kUsernames[hash % kUsernames.size()];
const std::string_view bio = kBios[hash / 11 % kBios.size()]; const std::string_view bio = kBios[hash / kBioHashStride % kBios.size()];
result.username = username; result.username = username;
result.bio = bio; result.bio = bio;
return result; return result;

View File

@@ -8,166 +8,82 @@
#include <boost/di.hpp> #include <boost/di.hpp>
#include <boost/program_options.hpp> #include <boost/program_options.hpp>
#include <chrono>
#include <exception> #include <exception>
#include <memory> #include <memory>
#include <optional> #include <optional>
#include <sstream>
#include <string> #include <string>
#include "biergarten_data_generator.h" #include "biergarten_data_generator.h"
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "data_generation/mock_generator.h" #include "data_generation/mock_generator.h"
#include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h" #include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h"
#include "data_model/application_options.h" #include "data_model/models.h"
#include "llama_backend_state.h" #include "llama_backend_state.h"
#include "services/enrichment_service.h" #include "services/database/export_service.h"
#include "services/export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service.h" #include "services/datetime/timer.h"
#include "services/wikipedia_service.h" #include "services/enrichment/enrichment_service.h"
#include "web_client/curl_web_client.h" #include "services/enrichment/mock_enrichment.h"
#include "services/enrichment/wikipedia_service.h"
#include "services/prompting/prompt_directory.h"
#include "web_client/http_web_client.h"
namespace prog_opts = boost::program_options;
namespace di = boost::di; namespace di = boost::di;
/**
* @brief Parse command-line arguments into ApplicationOptions.
*
* @param argc Command-line argument count.
* @param argv Command-line arguments.
* @return Parsed ApplicationOptions if parsing succeeded, std::nullopt
* otherwise.
*/
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
opt("temperature", prog_opts::value<float>()->default_value(1.0F),
"Sampling temperature (higher = more random)");
opt("top-p", prog_opts::value<float>()->default_value(0.95F),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k", prog_opts::value<uint32_t>()->default_value(64),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx", prog_opts::value<uint32_t>()->default_value(8192),
"Context window size in tokens (1-32768)");
opt("seed", prog_opts::value<int>()->default_value(-1),
"Sampler seed: -1 for random, otherwise non-negative integer");
// Handle the "no arguments" or "help" case
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map variables_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc),
variables_map);
prog_opts::notify(variables_map);
if (variables_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
const auto use_mocked = variables_map["mocked"].as<bool>();
const auto model_path = variables_map["model"].as<std::string>();
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: Either --mocked or --model must be specified");
return std::nullopt;
}
const bool has_llm_params = !variables_map["temperature"].defaulted() ||
!variables_map["top-p"].defaulted() ||
!variables_map["top-k"].defaulted() ||
!variables_map["seed"].defaulted();
if (use_mocked && has_llm_params) {
spdlog::warn(
"Sampling parameters (--temperature, --top-p, --top-k, --seed) are"
" ignored when using --mocked");
}
ApplicationOptions options;
options.use_mocked = use_mocked;
options.model_path = model_path;
options.temperature = variables_map["temperature"].as<float>();
options.top_p = variables_map["top-p"].as<float>();
options.top_k = variables_map["top-k"].as<uint32_t>();
options.n_ctx = variables_map["n-ctx"].as<uint32_t>();
options.seed = variables_map["seed"].as<int>();
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}
struct Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
};
int main(const int argc, char** argv) { int main(const int argc, char** argv) {
try { try {
Timer timer; Timer timer;
const CurlGlobalState curl_state;
const LlamaBackendState llama_backend_state;
spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v"); spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v");
const auto parsed_options = ParseArguments(argc, argv); #ifndef BIERGARTEN_MOCK_ONLY
const LlamaBackendState llama_backend_state;
#endif
#ifdef DEBUG
spdlog::set_level(spdlog::level::debug);
#endif
const std::optional<ApplicationOptions> parsed_options =
ParseArguments(argc, argv);
if (!parsed_options.has_value()) { if (!parsed_options.has_value()) {
return 0; return 0;
} }
const auto options = *parsed_options; const auto options = *parsed_options;
const std::string model_path = options.generator.model_path.string();
const auto sampling =
options.generator.sampling.value_or(SamplingOptions{});
std::unique_ptr<IPromptDirectory> prompt_directory;
if (!options.generator.use_mocked) {
try {
prompt_directory =
std::make_unique<PromptDirectory>(options.pipeline.prompt_dir);
} catch (const std::exception& dir_error) {
spdlog::error("[Startup] Invalid --prompt-dir: {}", dir_error.what());
return 1;
}
}
const auto injector = di::make_injector( const auto injector = di::make_injector(
di::bind<WebClient>().to<CURLWebClient>(),
di::bind<ApplicationOptions>().to(options), di::bind<ApplicationOptions>().to(options),
di::bind<IEnrichmentService>().to<WikipediaService>(), di::bind<std::string>().to(model_path),
di::bind<WebClient>().to<HttpWebClient>(),
di::bind<IExportService>().to<SqliteExportService>(), di::bind<IExportService>().to<SqliteExportService>(),
di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(), di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(),
di::bind<std::string>().to(options.model_path), di::bind<IEnrichmentService>().to(
[options](const auto& inj) -> std::unique_ptr<IEnrichmentService> {
if (options.generator.use_mocked) {
return std::make_unique<MockEnrichmentService>();
}
return std::make_unique<WikipediaEnrichmentService>(
inj.template create<std::unique_ptr<WebClient>>());
}),
di::bind<DataGenerator>().to( di::bind<DataGenerator>().to(
[options](const auto& inj) -> std::unique_ptr<DataGenerator> { [options, model_path, sampling, &prompt_directory](
if (options.use_mocked) { const auto& inj) -> std::unique_ptr<DataGenerator> {
if (options.generator.use_mocked) {
spdlog::info( spdlog::info(
"[Generator] Using MockGenerator (no model path provided)"); "[Generator] Using MockGenerator (no model path provided)");
return std::make_unique<MockGenerator>(); return std::make_unique<MockGenerator>();
@@ -176,12 +92,17 @@ int main(const int argc, char** argv) {
spdlog::info( spdlog::info(
"[Generator] Using LlamaGenerator: {} (temperature={}, " "[Generator] Using LlamaGenerator: {} (temperature={}, "
"top-p={}, top-k={}, n_ctx={}, seed={})", "top-p={}, top-k={}, n_ctx={}, seed={})",
options.model_path, options.temperature, options.top_p, model_path, sampling.temperature, sampling.top_p,
options.top_k, options.n_ctx, options.seed); sampling.top_k, sampling.n_ctx, sampling.seed);
return inj.template create<std::unique_ptr<LlamaGenerator>>(); return std::make_unique<LlamaGenerator>(
})); options, model_path,
inj.template create<std::unique_ptr<IPromptFormatter>>(),
std::move(prompt_directory));
})
auto generator = );
const auto generator =
injector.create<std::unique_ptr<BiergartenDataGenerator>>(); injector.create<std::unique_ptr<BiergartenDataGenerator>>();
if (!generator->Run()) { if (!generator->Run()) {

View File

@@ -0,0 +1,112 @@
/**
* @file wikipedia/fetch_extract.cc
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <chrono>
#include <format>
#include <string>
#include <string_view>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
using namespace boost;
std::string WikipediaEnrichmentService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
// 1. Cache Lookup
if (const auto cache_it = this->extract_cache_.find(cache_key);
cache_it != this->extract_cache_.end()) {
spdlog::debug("Wikipedia: Cache hit for {}!", cache_key);
return cache_it->second;
}
const std::string encoded = this->client_->EncodeURL(cache_key);
const std::string url = std::format(
"https://en.wikipedia.org/w/"
"api.php?action=query&titles={}&prop=extracts&explaintext=1&format=json",
encoded);
const std::string body = this->client_->Get(url);
{
using namespace std::literals::chrono_literals;
std::this_thread::sleep_for(1s);
}
// 2. Parse JSON
system::error_code ec;
json::value doc = json::parse(body, ec);
if (ec) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
ec.message());
return {};
}
// 3. Safe Extraction
const json::object* obj = doc.if_object();
if (obj == nullptr) {
spdlog::warn("WikipediaService: Expected root object for '{}'", query);
return {};
}
const json::value* query_ptr = obj->if_contains("query");
const json::value* pages_ptr =
((query_ptr != nullptr) && query_ptr->is_object())
? query_ptr->get_object().if_contains("pages")
: nullptr;
if ((pages_ptr == nullptr) || !pages_ptr->is_object()) {
spdlog::warn("WikipediaService: Missing query.pages for '{}'", query);
return {};
}
const json::object& pages = pages_ptr->get_object();
if (pages.empty()) {
spdlog::warn("WikipediaService: No pages returned for '{}'", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
// Wikipedia returns the page under a dynamic ID key; we just want the first
// one
const json::value& page_val = pages.begin()->value();
if (!page_val.is_object()) {
spdlog::warn("WikipediaService: Unexpected page format for '{}'", query);
return {};
}
const json::object& page = page_val.get_object();
// Handle 404/Missing status
if (page.contains("missing")) {
spdlog::warn("WikipediaService: Page '{}' does not exist", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
const json::value* extract_ptr = page.if_contains("extract");
if ((extract_ptr == nullptr) || !extract_ptr->is_string()) {
spdlog::warn("WikipediaService: No extract string found for '{}'", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
// 4. Success
std::string extract(extract_ptr->as_string());
spdlog::info("WikipediaService: Fetched {} chars for '{}'", extract.size(),
query);
this->extract_cache_.insert_or_assign(cache_key, extract);
return extract;
}

View File

@@ -0,0 +1,58 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <chrono>
#include <format>
#include <string>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
std::string WikipediaEnrichmentService::GetLocationContext(const Location& loc) {
using namespace std::literals::chrono_literals;
if (!this->client_) {
spdlog::warn("Client is nullptr.");
return {};
}
std::string result;
// std::string region_query(loc.city);
// if (!loc.country.empty()) {
// region_query += loc.state_province,
// region_query += ", ";
// region_query += loc.country;
// }
constexpr std::string_view brewing_query = "brewing";
const std::string location_query =
std::format("{}, {}", loc.city, loc.iso3166_2);
const std::string beer_query = std::format("beer in {}", loc.country);
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(brewing_query));
append_extract(FetchExtract(beer_query));
spdlog::info("Done fetching for {}. Sleeping for 10 seconds.",
location_query);
std::this_thread::sleep_for(10s);
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", location_query,
e.what());
}
return result;
}

View File

@@ -3,9 +3,10 @@
* @brief WikipediaService constructor implementation. * @brief WikipediaService constructor implementation.
*/ */
#include "services/wikipedia_service.h" #include "services/enrichment/wikipedia_service.h"
#include <utility> #include <utility>
WikipediaService::WikipediaService(std::unique_ptr<WebClient> client) WikipediaEnrichmentService::WikipediaEnrichmentService(
std::unique_ptr<WebClient> client)
: client_(std::move(client)) {} : client_(std::move(client)) {}

View File

@@ -0,0 +1,85 @@
/**
* @file services/prompt_directory.cc
* @brief PromptDirectory implementation: validates the directory at
* construction and loads named prompt files on demand with in-process caching.
*/
#include "services/prompting/prompt_directory.h"
#include <spdlog/spdlog.h>
#include <filesystem>
#include <fstream>
#include <stdexcept>
#include <string>
#include <string_view>
// ---------------------------------------------------------------------------
// PromptDirectory
// ---------------------------------------------------------------------------
PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir)
: prompt_dir_(prompt_dir) {
std::error_code ec;
// Scenario 4: directory must exist.
if (!std::filesystem::exists(prompt_dir_, ec) || ec) {
throw std::runtime_error(
"PromptDirectory: prompt directory does not exist: " +
prompt_dir_.string());
}
// Scenario 4: path must be a directory, not a file.
if (!std::filesystem::is_directory(prompt_dir_, ec) || ec) {
throw std::runtime_error(
"PromptDirectory: prompt directory path is not a directory: " +
prompt_dir_.string());
}
// Scenario 4: directory must be readable (probe with directory_iterator).
std::filesystem::directory_iterator probe(prompt_dir_, ec);
if (ec) {
throw std::runtime_error(
"PromptDirectory: prompt directory is not readable: " +
prompt_dir_.string() + " (" + ec.message() + ")");
}
spdlog::info("[PromptDirectory] Resolved prompt directory: {}",
prompt_dir_.string());
}
std::string PromptDirectory::Load(std::string_view key) {
const std::string key_str(key);
// Return cached content if already loaded during this run.
const auto cache_it = cache_.find(key_str);
if (cache_it != cache_.end()) {
return cache_it->second;
}
// Scenario 3: resolve <prompt_dir>/<key>.md and require it to exist.
const std::filesystem::path file_path =
prompt_dir_ / std::filesystem::path(key_str + ".md");
std::ifstream file(file_path);
if (!file.is_open()) {
throw std::runtime_error(
"PromptDirectory: prompt file not found for key '" + key_str +
"': " + file_path.string());
}
std::string content((std::istreambuf_iterator<char>(file)),
std::istreambuf_iterator<char>());
file.close();
if (content.empty()) {
throw std::runtime_error("PromptDirectory: prompt file for key '" +
key_str + "' is empty: " + file_path.string());
}
spdlog::info("[PromptDirectory] Loaded prompt '{}' from '{}' ({} chars)",
key_str, file_path.string(), content.size());
cache_.emplace(key_str, content);
return content;
}

View File

@@ -1,24 +0,0 @@
/**
* @file services/sqlite/build_database_path.cc
* @brief SqliteExportService::BuildDatabasePath() implementation.
*/
#include <filesystem>
#include <string>
#include "services/sqlite_export_service.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate =
std::filesystem::current_path() / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = std::filesystem::current_path() /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}

View File

@@ -1,28 +0,0 @@
/**
* @file services/sqlite/build_location_key.cc
* @brief SqliteExportService::BuildLocationKey() implementation.
*/
#include <iomanip>
#include <sstream>
#include "services/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h"
constexpr int kLocationPrecision = 17;
std::string SqliteExportService::BuildLocationKey(const Location& location) {
std::ostringstream key_stream;
key_stream << location.city << '\n'
<< location.state_province << '\n'
<< location.iso3166_2 << '\n'
<< location.country << '\n'
<< location.iso3166_1 << '\n'
<< std::setprecision(kLocationPrecision) << location.latitude
<< '\n'
<< std::setprecision(kLocationPrecision) << location.longitude
<< '\n'
<< sqlite_export_service_internal::SerializeLocalLanguages(
location.local_languages);
return key_stream.str();
}

View File

@@ -5,8 +5,8 @@
#include <stdexcept> #include <stdexcept>
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h" #include "services/database/sqlite_export_service_helpers.h"
void SqliteExportService::Finalize() { void SqliteExportService::Finalize() {
if (db_handle_ == nullptr) { if (db_handle_ == nullptr) {
@@ -14,7 +14,8 @@ void SqliteExportService::Finalize() {
} }
try { try {
FinalizeStatements(); insert_brewery_stmt_.reset();
insert_location_stmt_.reset();
if (transaction_open_) { if (transaction_open_) {
sqlite_export_service_internal::ExecSql( sqlite_export_service_internal::ExecSql(
db_handle_, "COMMIT;", "Failed to commit SQLite transaction"); db_handle_, "COMMIT;", "Failed to commit SQLite transaction");

View File

@@ -1,11 +0,0 @@
/**
* @file services/sqlite/finalize_statements.cc
* @brief SqliteExportService::FinalizeStatements() implementation.
*/
#include "services/sqlite_export_service.h"
void SqliteExportService::FinalizeStatements() noexcept {
insert_brewery_stmt_.reset();
insert_location_stmt_.reset();
}

View File

@@ -0,0 +1,66 @@
#include "services/database/sqlite_connection_helpers.h"
#include <stdexcept>
namespace sqlite_export_service_internal {
void SqliteDatabaseDeleter::operator()(sqlite3* handle) const noexcept {
if (handle != nullptr) {
sqlite3_close(handle);
}
}
void SqliteStatementDeleter::operator()(
sqlite3_stmt* statement) const noexcept {
if (statement != nullptr) {
sqlite3_finalize(statement);
}
}
void ThrowSqliteError(sqlite3* db_handle, std::string_view action) {
const std::string message =
db_handle != nullptr ? sqlite3_errmsg(db_handle) : "unknown SQLite error";
throw std::runtime_error(std::string(action) + ": " + message);
}
SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path) {
sqlite3* raw_handle = nullptr;
const int result = sqlite3_open(path.string().c_str(), &raw_handle);
SqliteDatabaseHandle handle(raw_handle);
if (result != SQLITE_OK) {
const std::string message = raw_handle != nullptr
? sqlite3_errmsg(raw_handle)
: "unknown SQLite error";
throw std::runtime_error("Failed to open SQLite export database: " +
message);
}
return handle;
}
void ExecSql(const SqliteDatabaseHandle& db_handle, std::string_view sql,
const char* action) {
char* error_message = nullptr;
const std::string sql_text(sql);
const int result = sqlite3_exec(db_handle.get(), sql_text.c_str(), nullptr,
nullptr, &error_message);
if (result != SQLITE_OK) {
const std::string message = error_message != nullptr
? error_message
: sqlite3_errmsg(db_handle.get());
sqlite3_free(error_message);
throw std::runtime_error(std::string(action) + ": " + message);
}
}
void RollbackTransactionNoThrow(
const SqliteDatabaseHandle& db_handle) noexcept {
if (!db_handle) {
return;
}
sqlite3_exec(db_handle.get(), "ROLLBACK;", nullptr, nullptr, nullptr);
}
} // namespace sqlite_export_service_internal

View File

@@ -0,0 +1,98 @@
#include "services/database/sqlite_statement_helpers.h"
#include <boost/json.hpp>
#include <cstring>
#include <limits>
#include <memory>
#include <stdexcept>
#include "services/database/sqlite_connection_helpers.h"
namespace sqlite_export_service_internal {
SqliteStatementHandle PrepareStatement(const SqliteDatabaseHandle& db_handle,
std::string_view sql,
const char* action) {
sqlite3_stmt* raw_statement = nullptr;
const std::string sql_text(sql);
const int result = sqlite3_prepare_v2(db_handle.get(), sql_text.c_str(), -1,
&raw_statement, nullptr);
SqliteStatementHandle statement(raw_statement);
if (result != SQLITE_OK) {
ThrowSqliteError(db_handle.get(), action);
}
return statement;
}
void ResetStatement(SqliteStatementHandle& statement) {
if (statement != nullptr) {
sqlite3_reset(statement.get());
sqlite3_clear_bindings(statement.get());
}
}
void Bind(const SqliteStatementHandle& statement,
const BindParam<std::string_view>& param) {
const auto byte_count = param.value.size();
if (byte_count > static_cast<std::size_t>(std::numeric_limits<int>::max())) {
ThrowSqliteError(sqlite3_db_handle(statement.get()), param.action);
}
auto delete_char_array = [](void* data) noexcept {
// NOLINT(cppcoreguidelines-owning-memory)
delete[] static_cast<char*>(data);
};
// NOLINT(cppcoreguidelines-avoid-c-arrays, modernize-avoid-c-arrays)
auto buffer = std::make_unique<char[]>(byte_count + 1);
std::memcpy(buffer.get(), param.value.data(), byte_count);
buffer[byte_count] = '\0';
char* raw_buffer = buffer.release();
if (sqlite3_bind_text(statement.get(), param.index, raw_buffer,
static_cast<int>(byte_count),
delete_char_array) != SQLITE_OK) {
delete_char_array(raw_buffer);
ThrowSqliteError(sqlite3_db_handle(statement.get()), param.action);
}
}
void Bind(const SqliteStatementHandle& statement,
const BindParam<double>& param) {
if (sqlite3_bind_double(statement.get(), param.index, param.value) !=
SQLITE_OK) {
ThrowSqliteError(sqlite3_db_handle(statement.get()), param.action);
}
}
void Bind(const SqliteStatementHandle& statement,
const BindParam<sqlite3_int64>& param) {
if (sqlite3_bind_int64(statement.get(), param.index, param.value) !=
SQLITE_OK) {
ThrowSqliteError(sqlite3_db_handle(statement.get()), param.action);
}
}
void StepStatement(const SqliteDatabaseHandle& db_handle,
const SqliteStatementHandle& statement,
std::string_view action) {
if (sqlite3_step(statement.get()) != SQLITE_DONE) {
ThrowSqliteError(db_handle.get(), action);
}
}
sqlite3_int64 LastInsertRowId(const SqliteDatabaseHandle& db_handle) {
return sqlite3_last_insert_rowid(db_handle.get());
}
std::string SerializeVector(const std::vector<std::string>& str_vec) {
boost::json::array array(str_vec.size());
for (const auto& s : str_vec) {
array.emplace_back(s);
}
return boost::json::serialize(array);
}
} // namespace sqlite_export_service_internal

View File

@@ -8,8 +8,56 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h" #include "services/database/sqlite_export_service_helpers.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}
void SqliteExportService::InitializeSchema() const {
sqlite_export_service_internal::ExecSql(
db_handle_, sqlite_export_service_internal::kCreateLocationsTableSql,
"Failed to create SQLite locations table");
sqlite_export_service_internal::ExecSql(
db_handle_, sqlite_export_service_internal::kCreateBreweriesTableSql,
"Failed to create SQLite breweries table");
}
void SqliteExportService::PrepareStatements() {
insert_location_stmt_ = sqlite_export_service_internal::PrepareStatement(
db_handle_, sqlite_export_service_internal::kInsertLocationSql,
"Failed to prepare SQLite location insert statement");
insert_brewery_stmt_ = sqlite_export_service_internal::PrepareStatement(
db_handle_, sqlite_export_service_internal::kInsertBrewerySql,
"Failed to prepare SQLite brewery insert statement");
}
void SqliteExportService::RollbackAndCloseNoThrow() noexcept {
if (db_handle_ == nullptr) {
return;
}
if (transaction_open_) {
sqlite_export_service_internal::RollbackTransactionNoThrow(db_handle_);
transaction_open_ = false;
}
insert_brewery_stmt_.reset();
insert_location_stmt_.reset();
db_handle_.reset();
location_cache_.clear();
}
void SqliteExportService::Initialize() { void SqliteExportService::Initialize() {
if (db_handle_ != nullptr) { if (db_handle_ != nullptr) {

View File

@@ -1,16 +0,0 @@
/**
* @file services/sqlite/initialize_schema.cc
* @brief SqliteExportService::InitializeSchema() implementation.
*/
#include "services/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h"
void SqliteExportService::InitializeSchema() {
sqlite_export_service_internal::ExecSql(
db_handle_, sqlite_export_service_internal::kCreateLocationsTableSql,
"Failed to create SQLite locations table");
sqlite_export_service_internal::ExecSql(
db_handle_, sqlite_export_service_internal::kCreateBreweriesTableSql,
"Failed to create SQLite breweries table");
}

View File

@@ -1,16 +0,0 @@
/**
* @file services/sqlite/prepare_statements.cc
* @brief SqliteExportService::PrepareStatements() implementation.
*/
#include "services/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h"
void SqliteExportService::PrepareStatements() {
insert_location_stmt_ = sqlite_export_service_internal::PrepareStatement(
db_handle_, sqlite_export_service_internal::kInsertLocationSql,
"Failed to prepare SQLite location insert statement");
insert_brewery_stmt_ = sqlite_export_service_internal::PrepareStatement(
db_handle_, sqlite_export_service_internal::kInsertBrewerySql,
"Failed to prepare SQLite brewery insert statement");
}

View File

@@ -3,13 +3,33 @@
* @brief SqliteExportService::ProcessRecord() implementation. * @brief SqliteExportService::ProcessRecord() implementation.
*/ */
#include <iomanip>
#include <sstream>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h" #include "services/database/sqlite_export_service_helpers.h"
void SqliteExportService::ProcessRecord(const GeneratedBrewery& brewery) { constexpr int kLocationPrecision = 17;
std::string SqliteExportService::BuildLocationKey(const Location& location) {
std::ostringstream key_stream;
key_stream << location.city << '\n'
<< location.state_province << '\n'
<< location.iso3166_2 << '\n'
<< location.country << '\n'
<< location.iso3166_1 << '\n'
<< std::setprecision(kLocationPrecision) << location.latitude
<< '\n'
<< std::setprecision(kLocationPrecision) << location.longitude
<< '\n'
<< sqlite_export_service_internal::SerializeVector(
location.local_languages);
return key_stream.str();
}
uint64_t SqliteExportService::ProcessRecord(const GeneratedBrewery& brewery) {
if (db_handle_ == nullptr || !transaction_open_) { if (db_handle_ == nullptr || !transaction_open_) {
throw std::runtime_error("SQLite export service is not initialized"); throw std::runtime_error("SQLite export service is not initialized");
} }
@@ -22,44 +42,60 @@ void SqliteExportService::ProcessRecord(const GeneratedBrewery& brewery) {
location_id = cached_location->second; location_id = cached_location->second;
} else { } else {
const std::string local_languages_json = const std::string local_languages_json =
sqlite_export_service_internal::SerializeLocalLanguages( sqlite_export_service_internal::SerializeVector(
brewery.location.local_languages); brewery.location.local_languages);
sqlite_export_service_internal::BindText( sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationCityBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.location.city, "Failed to bind SQLite location city"); .index = sqlite_export_service_internal::kLocationCityBindIndex,
sqlite_export_service_internal::BindText( .value = brewery.location.city,
.action = "Failed to bind SQLite location city"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationStateProvinceBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.location.state_province, .index =
"Failed to bind SQLite location state/province"); sqlite_export_service_internal::kLocationStateProvinceBindIndex,
sqlite_export_service_internal::BindText( .value = brewery.location.state_province,
.action = "Failed to bind SQLite location state/province"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationIso31662BindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.location.iso3166_2, .index = sqlite_export_service_internal::kLocationIso31662BindIndex,
"Failed to bind SQLite location ISO 3166-2 code"); .value = brewery.location.iso3166_2,
sqlite_export_service_internal::BindText( .action = "Failed to bind SQLite location ISO 3166-2 code"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationCountryBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.location.country, "Failed to bind SQLite location country"); .index = sqlite_export_service_internal::kLocationCountryBindIndex,
sqlite_export_service_internal::BindText( .value = brewery.location.country,
.action = "Failed to bind SQLite location country"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationIso31661BindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.location.iso3166_1, .index = sqlite_export_service_internal::kLocationIso31661BindIndex,
"Failed to bind SQLite location ISO 3166-1 code"); .value = brewery.location.iso3166_1,
sqlite_export_service_internal::BindText( .action = "Failed to bind SQLite location ISO 3166-1 code"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationLanguagesBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
local_languages_json, "Failed to bind SQLite location languages"); .index =
sqlite_export_service_internal::BindDouble( sqlite_export_service_internal::kLocationLanguagesBindIndex,
.value = local_languages_json,
.action = "Failed to bind SQLite location languages"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationLatitudeBindIndex, sqlite_export_service_internal::BindParam{
brewery.location.latitude, "Failed to bind SQLite location latitude"); .index = sqlite_export_service_internal::kLocationLatitudeBindIndex,
sqlite_export_service_internal::BindDouble( .value = brewery.location.latitude,
.action = "Failed to bind SQLite location latitude"});
sqlite_export_service_internal::Bind(
insert_location_stmt_, insert_location_stmt_,
sqlite_export_service_internal::kLocationLongitudeBindIndex, sqlite_export_service_internal::BindParam{
brewery.location.longitude, "Failed to bind SQLite location longitude"); .index =
sqlite_export_service_internal::kLocationLongitudeBindIndex,
.value = brewery.location.longitude,
.action = "Failed to bind SQLite location longitude"});
sqlite_export_service_internal::StepStatement( sqlite_export_service_internal::StepStatement(
db_handle_, insert_location_stmt_, db_handle_, insert_location_stmt_,
@@ -70,31 +106,43 @@ void SqliteExportService::ProcessRecord(const GeneratedBrewery& brewery) {
sqlite_export_service_internal::ResetStatement(insert_location_stmt_); sqlite_export_service_internal::ResetStatement(insert_location_stmt_);
} }
sqlite_export_service_internal::BindInt64( sqlite_export_service_internal::Bind(
insert_brewery_stmt_, insert_brewery_stmt_,
sqlite_export_service_internal::kBreweryLocationIdBindIndex, location_id, sqlite_export_service_internal::BindParam<sqlite3_int64>{
"Failed to bind SQLite brewery location id"); .index = sqlite_export_service_internal::kBreweryLocationIdBindIndex,
sqlite_export_service_internal::BindText( .value = location_id,
.action = "Failed to bind SQLite brewery location id"});
sqlite_export_service_internal::Bind(
insert_brewery_stmt_, insert_brewery_stmt_,
sqlite_export_service_internal::kBreweryEnglishNameBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.brewery.name_en, "Failed to bind SQLite brewery English name"); .index = sqlite_export_service_internal::kBreweryEnglishNameBindIndex,
sqlite_export_service_internal::BindText( .value = brewery.brewery.name_en,
.action = "Failed to bind SQLite brewery English name"});
sqlite_export_service_internal::Bind(
insert_brewery_stmt_, insert_brewery_stmt_,
sqlite_export_service_internal::kBreweryEnglishDescriptionBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.brewery.description_en, .index = sqlite_export_service_internal::
"Failed to bind SQLite brewery English description"); kBreweryEnglishDescriptionBindIndex,
sqlite_export_service_internal::BindText( .value = brewery.brewery.description_en,
.action = "Failed to bind SQLite brewery English description"});
sqlite_export_service_internal::Bind(
insert_brewery_stmt_, insert_brewery_stmt_,
sqlite_export_service_internal::kBreweryLocalNameBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.brewery.name_local, "Failed to bind SQLite brewery local name"); .index = sqlite_export_service_internal::kBreweryLocalNameBindIndex,
sqlite_export_service_internal::BindText( .value = brewery.brewery.name_local,
.action = "Failed to bind SQLite brewery local name"});
sqlite_export_service_internal::Bind(
insert_brewery_stmt_, insert_brewery_stmt_,
sqlite_export_service_internal::kBreweryLocalDescriptionBindIndex, sqlite_export_service_internal::BindParam<std::string_view>{
brewery.brewery.description_local, .index =
"Failed to bind SQLite brewery local description"); sqlite_export_service_internal::kBreweryLocalDescriptionBindIndex,
.value = brewery.brewery.description_local,
.action = "Failed to bind SQLite brewery local description"});
sqlite_export_service_internal::StepStatement( sqlite_export_service_internal::StepStatement(
db_handle_, insert_brewery_stmt_, "Failed to insert SQLite brewery row"); db_handle_, insert_brewery_stmt_, "Failed to insert SQLite brewery row");
sqlite_export_service_internal::ResetStatement(insert_brewery_stmt_); sqlite_export_service_internal::ResetStatement(insert_brewery_stmt_);
return sqlite_export_service_internal::LastInsertRowId(db_handle_);
} }

View File

@@ -1,21 +0,0 @@
/**
* @file services/sqlite/rollback_and_close_no_throw.cc
* @brief SqliteExportService::RollbackAndCloseNoThrow() implementation.
*/
#include "services/sqlite_export_service.h"
void SqliteExportService::RollbackAndCloseNoThrow() noexcept {
if (db_handle_ == nullptr) {
return;
}
if (transaction_open_) {
sqlite_export_service_internal::RollbackTransactionNoThrow(db_handle_);
transaction_open_ = false;
}
FinalizeStatements();
db_handle_.reset();
location_cache_.clear();
}

View File

@@ -3,12 +3,13 @@
* @brief SqliteExportService constructor and destructor implementation. * @brief SqliteExportService constructor and destructor implementation.
*/ */
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include <memory> #include <memory>
SqliteExportService::SqliteExportService() SqliteExportService::SqliteExportService(const ApplicationOptions& options)
: date_time_provider_(std::make_unique<SystemDateTimeProvider>()) {} : date_time_provider_(std::make_unique<SystemDateTimeProvider>()),
output_path_(options.pipeline.output_path) {}
SqliteExportService::~SqliteExportService() { SqliteExportService::~SqliteExportService() {
if (db_handle_ != nullptr) { if (db_handle_ != nullptr) {

View File

@@ -1,61 +0,0 @@
/**
* @file wikipedia/fetch_extract.cc
* @brief WikipediaService::FetchExtract() implementation.
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <string>
#include <string_view>
#include "services/wikipedia_service.h"
std::string WikipediaService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
const auto cache_it = this->extract_cache_.find(cache_key);
if (cache_it != this->extract_cache_.end()) {
return cache_it->second;
}
const std::string encoded = this->client_->UrlEncode(cache_key);
const std::string url =
"https://en.wikipedia.org/w/api.php?action=query&titles=" + encoded +
"&prop=extracts&explaintext=1&format=json";
const std::string body = this->client_->Get(url);
boost::system::error_code parse_error;
boost::json::value doc = boost::json::parse(body, parse_error);
if (!parse_error && doc.is_object()) {
try {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
const std::string_view extract_view = page.at("extract").as_string();
std::string extract(extract_view);
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
this->extract_cache_.emplace(cache_key, extract);
return extract;
}
}
this->extract_cache_.emplace(cache_key, std::string{});
} catch (const std::exception& e) {
spdlog::warn(
"WikipediaService: failed to parse response structure for '{}': "
"{}",
query, e.what());
return {};
}
} else if (parse_error) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
parse_error.message());
}
return {};
}

View File

@@ -1,47 +0,0 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <string>
#include "services/wikipedia_service.h"
std::string WikipediaService::GetLocationContext(const Location& loc) {
if (!client_) {
return {};
}
std::string result;
std::string region_query(loc.city);
if (!loc.country.empty()) {
region_query += ", ";
region_query += loc.country;
}
const std::string beer_query = "beer in " + loc.country;
const std::string city_beer_query = "beer in " + loc.city;
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(region_query));
append_extract(FetchExtract(beer_query));
append_extract(FetchExtract(city_beer_query));
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", region_query,
e.what());
}
return result;
}

View File

@@ -1,19 +0,0 @@
/**
* @file web_client/curl_global_state.cc
* @brief CurlGlobalState constructor and destructor implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include "web_client/curl_web_client.h"
CurlGlobalState::CurlGlobalState() {
if (curl_global_init(CURL_GLOBAL_DEFAULT) != CURLE_OK) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl globally");
}
}
CurlGlobalState::~CurlGlobalState() { curl_global_cleanup(); }

View File

@@ -1,86 +0,0 @@
/**
* @file web_client/curl_web_client_get.cc
* @brief CURLWebClient::Get() implementation.
*/
#include <curl/curl.h>
#include <cstdint>
#include <limits>
#include <memory>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
using CurlHandle = std::unique_ptr<CURL, decltype(&curl_easy_cleanup)>;
static constexpr long kConnectionTimeout = 10;
static constexpr long kRequestTimeout = 30;
static constexpr int32_t kOkHttpStatus = 200;
static CurlHandle CreateHandle() {
CURL* handle = curl_easy_init();
if (handle == nullptr) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl handle");
}
return {handle, &curl_easy_cleanup};
}
static void SetCommonGetOptions(CURL* curl, const std::string& url) {
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, "biergarten-pipeline/0.1.0");
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
curl_easy_setopt(curl, CURLOPT_MAXREDIRS, 5L);
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, kConnectionTimeout);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, kRequestTimeout);
curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip");
}
// curl write callback that appends response data into a std::string
static size_t WriteCallbackString(void* contents, const size_t size,
const size_t nmemb, void* userp) {
const size_t real_size = size * nmemb;
auto* str = static_cast<std::string*>(userp);
str->append(static_cast<char*>(contents), real_size);
return real_size;
}
std::string CURLWebClient::Get(const std::string& url) {
const CurlHandle curl = CreateHandle();
std::string response_string;
SetCommonGetOptions(curl.get(), url);
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, WriteCallbackString);
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &response_string);
CURLcode curl_result = curl_easy_perform(curl.get());
if (curl_result != CURLE_OK) {
const auto error = std::string("[CURLWebClient] GET failed: ") +
curl_easy_strerror(curl_result);
throw std::runtime_error(error);
}
long curl_http_code = 0;
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &curl_http_code);
if (curl_http_code < std::numeric_limits<int32_t>::min() ||
curl_http_code > std::numeric_limits<int32_t>::max()) {
throw std::runtime_error("[CURLWebClient] Invalid HTTP status code: " +
std::to_string(curl_http_code));
}
const int32_t http_code = static_cast<int32_t>(curl_http_code);
if (http_code != kOkHttpStatus) {
const std::string error = "[CURLWebClient] HTTP error " +
std::to_string(http_code) + " for URL " + url;
throw std::runtime_error(error);
}
return response_string;
}

View File

@@ -1,24 +0,0 @@
/**
* @file web_client/curl_web_client_url_encode.cc
* @brief CURLWebClient::UrlEncode() implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
std::string CURLWebClient::UrlEncode(const std::string& value) {
// A NULL handle is fine for UTF-8 encoding according to libcurl docs.
char* output = curl_easy_escape(nullptr, value.c_str(), 0);
if (!output) {
throw std::runtime_error("[CURLWebClient] curl_easy_escape failed");
}
std::string result(output);
curl_free(output);
return result;
}

View File

@@ -0,0 +1,68 @@
/**
* @file web_client/http_web_client.cc
* @brief cpp-httplib implementation of WebClient.
*/
#include "web_client/http_web_client.h"
#include <httplib.h>
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include "spdlog/spdlog.h"
namespace {
constexpr time_t kConnectionTimeoutSeconds = 5;
constexpr time_t kReadTimeoutSeconds = 10;
constexpr int kSuccessMin = 200;
constexpr int kSuccessMax = 300;
const std::regex kUrlRegex(
R"(^(https?://[^/?#]+)(/[^?#]*(?:\?[^#]*)?(?:#.*)?)?)");
std::pair<std::string, std::string> SplitUrl(const std::string& url) {
std::smatch match;
if (!std::regex_match(url, match, kUrlRegex)) {
throw std::invalid_argument("[HttpWebClient] Malformed URL: " + url);
}
return {match[1].str(), match[2].matched ? match[2].str() : "/"};
}
} // namespace
std::string HttpWebClient::Get(const std::string& url) {
const auto [origin, path] = SplitUrl(url);
httplib::Client client(origin);
client.set_follow_location(true);
client.set_connection_timeout(kConnectionTimeoutSeconds);
client.set_read_timeout(kReadTimeoutSeconds);
client.set_default_headers({
{"Accept", "application/json"},
{"User-Agent", "biergarten-pipeline/1.0"}
});
const httplib::Result result = client.Get(path);
if (!result) {
throw std::runtime_error(
"[HttpWebClient] Request failed for URL: " + url +
"" + httplib::to_string(result.error()));
}
if (result->status < kSuccessMin || result->status >= kSuccessMax) {
spdlog::error("[HttpWebClient] Request failed for URL: " + url);
throw std::runtime_error(
"[HttpWebClient] HTTP " + std::to_string(result->status) +
" for URL: " + url);
}
return result->body;
}
std::string HttpWebClient::EncodeURL(const std::string& value) {
return httplib::encode_uri_component(value);
}