5 Commits

Author SHA1 Message Date
Aaron Po
271c6fa99f update docs 2026-05-01 18:30:38 -04:00
Aaron Po
316fda1775 codebase formatting 2026-05-01 17:40:37 -04:00
Aaron Po
91e18888fe readability updates: remove magic numbers, update comments 2026-05-01 17:38:16 -04:00
Aaron Po
9051f55114 add prompt dir app option 2026-05-01 12:25:05 -04:00
Aaron Po
01849062d5 Refactor ApplicationOptions to separate config concerns 2026-05-01 00:40:21 -04:00
94 changed files with 1395 additions and 2814 deletions

2
.gitattributes vendored
View File

@@ -1 +1 @@
archive/** linguist-vendored archive/* linguist-vendored

View File

@@ -18,7 +18,6 @@ descriptions via a local GGUF model or a deterministic mock.
- [Build](#build) - [Build](#build)
- [Model](#model) - [Model](#model)
- [Run](#run) - [Run](#run)
- [Docker / RunPod](#docker--runpod)
- [Architecture](#architecture) - [Architecture](#architecture)
- [Pipeline Stages](#pipeline-stages) - [Pipeline Stages](#pipeline-stages)
- [Key Components](#key-components) - [Key Components](#key-components)
@@ -52,7 +51,7 @@ step.
### Build ### Build
Requirements: C++20 compiler, CMake 3.31+, OpenSSL, Boost (JSON and Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and
ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system
SQLite package is required. SQLite package is required.
@@ -61,16 +60,6 @@ cmake -S . -B build
cmake --build build cmake --build build
``` ```
CMake automatically detects whether a compatible llama.cpp installation is
present on the system (`libllama`, `libggml`, `libggml-base`, and `llama.h`
visible on the default search paths). If found, it links against those
libraries and skips the FetchContent build. If not found, it fetches and builds
llama.cpp from source at tag `b9012`. No additional flags are required in
either case.
Metal is enabled automatically on Apple Silicon. CUDA or HIP/ROCm is detected
automatically on Linux when the relevant toolkit is present.
### Model ### Model
> Skip this step if you only need `--mocked`. > Skip this step if you only need `--mocked`.
@@ -85,27 +74,20 @@ curl -L \
### Run ### Run
Run from `build/` so the copied `locations.json` and `prompts/` are available. Run from `build/` so the copied `locations.json` and `prompts/` are available.
Each run writes a fresh dated SQLite file such as Each run also writes a fresh dated SQLite file such as
`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. `biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory.
```bash ```bash
./biergarten-pipeline --mocked ./biergarten-pipeline --mocked
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
./biergarten-pipeline \
--model ../models/google_gemma-4-E4B-it-Q6_K.gguf \
--prompt-dir prompts \
--temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
``` ```
#### CLI Flags #### CLI Flags
| Flag | Purpose | | Flag | Purpose |
| --------------- | ---------------------------------------------------------------------------------------------------- | | --------------- | ------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. | | `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. | | `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--prompt-dir` | Directory containing prompt files (e.g. `BREWERY_GENERATION.md`). Required unless `--mocked` is set. |
| `--output, -o` | Directory for generated SQLite artifacts. Default: `output`. |
| `--log-path` | Path for application logs. Default: `pipeline.log`. |
| `--temperature` | Sampling temperature. Default: `1.0`. | | `--temperature` | Sampling temperature. Default: `1.0`. |
| `--top-p` | Nucleus sampling. Default: `0.95`. | | `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--top-k` | Top-k sampling. Default: `64`. | | `--top-k` | Top-k sampling. Default: `64`. |
@@ -118,91 +100,7 @@ error before the pipeline starts. Sampling flags are ignored when `--mocked` is
set. set.
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after The post-build step copies `prompts/` into `build/prompts/`. Rebuild after
editing any prompt file. editing `prompts/system.md`.
---
## Docker / RunPod
The `tooling/pipeline/runpod/` directory contains a GPU-ready container
configuration for running the pipeline on RunPod or any Docker host with an
NVIDIA GPU.
### How it works
The container uses a two-stage build. The first stage pulls prebuilt
`libllama`, `libggml`, and backend plugin libraries (including `libggml-cuda.so`
and the CPU variant plugins) from `ghcr.io/ggml-org/llama.cpp:full-cuda`. The
second stage copies those libraries into `/usr/local/lib` and runs `ldconfig` so
the dynamic linker and `dlopen` calls from `ggml_backend_load_all()` can resolve
the CUDA backend plugin at runtime. llama.cpp headers are cloned at the matching
tag and installed into `/usr/local/include`. CMake auto-detects both and skips
the FetchContent source build entirely, keeping image build times short.
`GGML_BACKEND_PATH` is set to `/usr/local/lib` so llama.cpp knows where to scan
for backend plugins.
### Build the image
Run from the `tooling/pipeline/` directory (the CMake project root), not from
inside `runpod/`, so the `COPY . .` step picks up the full project context.
```bash
docker build -t biergarten-pipeline:latest -f runpod/Dockerfile .
```
To monitor the full build output and confirm CMake selects the system llama.cpp:
```bash
docker build \
--progress=plain \
--no-cache \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```
Look for `[biergarten] Found system llama.cpp — skipping FetchContent` in the
output to confirm the fast path was taken.
### Run in mocked mode
No model or GPU required. Useful for validating the pipeline logic and SQLite
export path.
```bash
docker run --rm \
-e BIERGARTEN_MODE=mocked \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
### Run in live mode
Mount your GGUF model before starting. The container validates the model path
before launching the binary.
```bash
docker run --rm \
--runtime=nvidia \
-e BIERGARTEN_MODE=live \
-e GGML_BACKEND_PATH="/usr/local/lib/libggml-cuda.so" \
-v "$PWD/models:/workspace/models" \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
The model must be present at `./models/google_gemma-4-E4B-it-Q6_K.gguf` on the
host. See [Model](#model) above for the download command.
### RunPod deployment
Use a GPU pod template. Mount persistent storage for `/workspace/models`,
`/workspace/output`, and `/workspace/logs`. Set `BIERGARTEN_MODE=live` in the
template environment. See `tooling/pipeline/runpod/pod-template.yaml` for a
starter template.
--- ---
@@ -299,18 +197,16 @@ code, latitude, and longitude for each entry.
## Tech Stack ## Tech Stack
- C++20 - C++20
- CMake 3.31+ - CMake 3.24+
- Boost.JSON, Boost.ProgramOptions, Boost.DI - Boost.JSON, Boost.ProgramOptions, Boost.DI
- spdlog - spdlog
- cpp-httplib (with OpenSSL) - libcurl
- SQLite amalgamation fetched and compiled via CMake FetchContent - SQLite amalgamation fetched and compiled via CMake FetchContent
- llama.cpp (auto-detected from system install or fetched via FetchContent) - llama.cpp
- Docker with NVIDIA CUDA 12.6 base image for GPU container builds
- RunPod for cloud GPU inference
The build fetches Boost.DI, spdlog, and SQLite via CMake. llama.cpp is fetched The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is
only when a system installation is not detected. Metal is enabled on Apple enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit
Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present. is present.
> **Code Style:** Modern C++20 throughout — RAII for ownership, > **Code Style:** Modern C++20 throughout — RAII for ownership,
> `std::unique_ptr` for injected dependencies, `std::optional` for parse > `std::unique_ptr` for injected dependencies, `std::optional` for parse
@@ -322,7 +218,7 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
## Tested Hardware ## Tested Hardware
### ARM macOS M1 Pro ### ARM macOS - M1 Pro
| | | | | |
| --------- | --------------------------------- | | --------- | --------------------------------- |
@@ -333,7 +229,7 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with Metal | | Inference | llama.cpp with Metal |
### x86_64 Linux NVIDIA RTX 2000 ### x86_64 Linux - NVIDIA RTX 2000
| | | | | |
| --------- | ------------------------------ | | --------- | ------------------------------ |
@@ -344,15 +240,6 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with CUDA 12.x | | Inference | llama.cpp with CUDA 12.x |
### x86_64 Linux — Docker / RunPod (NVIDIA CUDA)
| | |
| --------- | ------------------------------------------- |
| Host | RunPod GPU pod |
| Base | nvidia/cuda:12.6.3-devel-ubuntu24.04 |
| Model | Gemma 4 E4B Q6_K |
| Inference | llama.cpp prebuilt CUDA backends via dlopen |
--- ---
## Fixture Strategy ## Fixture Strategy
@@ -373,9 +260,8 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| `includes/` | Public headers and shared models. | | `includes/` | Public headers and shared models. |
| `src/` | Implementation files. | | `src/` | Implementation files. |
| `locations.json` | Curated city input copied into the build tree. | | `locations.json` | Curated city input copied into the build tree. |
| `prompts/` | System prompts used by the model-backed path. | | `prompts/` | System prompt used by the model-backed path. |
| `diagrams/` | Architecture and pipeline diagrams. | | `diagrams/` | Architecture and pipeline diagrams. |
| `tooling/pipeline/runpod/` | Dockerfile, launcher, and RunPod pod template. |
| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. | | `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. |
--- ---
@@ -390,7 +276,6 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
- `src/data_generation/llama/` — local inference, prompt loading, output - `src/data_generation/llama/` — local inference, prompt loading, output
validation. validation.
- `src/data_generation/mock/` — deterministic fallback. - `src/data_generation/mock/` — deterministic fallback.
- `tooling/pipeline/runpod/` — container build and runtime launcher.
--- ---

View File

@@ -29,7 +29,7 @@ if (Are arguments valid?) then (no)
else (yes) else (yes)
endif endif
:Init OpenSSL global state & LlamaBackendState; :Init CurlGlobalState & LlamaBackendState;
:di::make_injector(...); :di::make_injector(...);
:injector.create<std::unique_ptr<BiergartenDataGenerator>>(); :injector.create<std::unique_ptr<BiergartenDataGenerator>>();
:BiergartenDataGenerator::Run(); :BiergartenDataGenerator::Run();

View File

@@ -26,7 +26,6 @@ skinparam note {
title The Biergarten Data Pipeline - Class Diagram title The Biergarten Data Pipeline - Class Diagram
class BiergartenDataGenerator { class BiergartenDataGenerator {
- logger_ : std::shared_ptr<ILogger>
- context_service_ : std::unique_ptr<IEnrichmentService> - context_service_ : std::unique_ptr<IEnrichmentService>
- generator_ : std::unique_ptr<DataGenerator> - generator_ : std::unique_ptr<DataGenerator>
- exporter_ : std::unique_ptr<IExportService> - exporter_ : std::unique_ptr<IExportService>
@@ -37,46 +36,6 @@ class BiergartenDataGenerator {
- LogResults() : void - LogResults() : void
} }
class LogLevel <<enumeration>> {
Debug
Info
Warn
Error
}
class PipelinePhase <<enumeration>> {
Startup
UserGeneration
BreweryAndBeerGeneration
CheckinGeneration
RatingGeneration
FollowGeneration
Teardown
}
struct LogEntry {
+ timestamp : std::chrono::system_clock::time_point
+ level : LogLevel
+ phase : PipelinePhase
+ message : std::string
+ worker : std::optional<std::string>
}
interface ILogger <<interface>> {
+ Log(entry : const LogEntry&) : void
}
class LogProducer {
- channel_ : BoundedChannel<LogEntry>&
+ Log(entry : const LogEntry&) : void
}
class LogDispatcher {
- channel_ : BoundedChannel<LogEntry>&
+ Run() : void
- ToSpdlogLevel(level) : spdlog::level::level_enum
}
interface IEnrichmentService <<interface>> { interface IEnrichmentService <<interface>> {
+ GetLocationContext(loc : const Location&) : std::string + GetLocationContext(loc : const Location&) : std::string
} }
@@ -93,7 +52,7 @@ interface WebClient <<interface>> {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class HttpWebClient { class CURLWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -164,21 +123,14 @@ class SystemDateTimeProvider {
} }
' Structural Relationships / Dependency Injection ' Structural Relationships / Dependency Injection
BiergartenDataGenerator *-- ILogger : owns
BiergartenDataGenerator *-- IEnrichmentService : owns BiergartenDataGenerator *-- IEnrichmentService : owns
BiergartenDataGenerator *-- DataGenerator : owns BiergartenDataGenerator *-- DataGenerator : owns
BiergartenDataGenerator *-- IExportService : owns BiergartenDataGenerator *-- IExportService : owns
LogEntry *-- LogLevel
LogEntry *-- PipelinePhase
ILogger <|.. LogProducer : implements
LogProducer ..> LogEntry : emits
LogDispatcher ..> LogEntry : consumes
IEnrichmentService <|.. WikipediaService : implements IEnrichmentService <|.. WikipediaService : implements
WikipediaService *-- WebClient : owns WikipediaService *-- WebClient : owns
WebClient <|.. HttpWebClient : implements WebClient <|.. CURLWebClient : implements
DataGenerator <|.. MockGenerator : implements DataGenerator <|.. MockGenerator : implements
DataGenerator <|.. LlamaGenerator : implements DataGenerator <|.. LlamaGenerator : implements

View File

@@ -13,7 +13,7 @@ if (Invalid args?) then (yes)
stop stop
else (no) else (no)
endif endif
:Init OpenSSL global state & LlamaBackendState; :Init CurlGlobalState & LlamaBackendState;
:Build DI injector; :Build DI injector;
:Initialize SqliteExportService; :Initialize SqliteExportService;

View File

@@ -1,4 +1,4 @@
@startuml class_diagram @startuml
' ========================================== ' ==========================================
' CONFIGURATION & STYLING ' CONFIGURATION & STYLING
@@ -8,8 +8,6 @@ skinparam classAttributeFontSize 9
skinparam defaultFontSize 25 skinparam defaultFontSize 25
skinparam titleFontSize 30 skinparam titleFontSize 30
title Biergarten Data Pipeline — Class Diagram
package "Domain: Models" { package "Domain: Models" {
class Location { class Location {
@@ -143,7 +141,7 @@ package "Domain: Models" {
LocationContext *-- Completeness LocationContext *-- Completeness
} }
@startuml
package "Domain: Application Configuration" { package "Domain: Application Configuration" {
class SamplingOptions { class SamplingOptions {
+ temperature: float = 1.0F + temperature: float = 1.0F
@@ -169,10 +167,12 @@ package "Domain: Application Configuration" {
+ pipeline: PipelineOptions + pipeline: PipelineOptions
} }
' --- Domain Model Relationships ---
ApplicationOptions *-- GeneratorOptions ApplicationOptions *-- GeneratorOptions
ApplicationOptions *-- PipelineOptions ApplicationOptions *-- PipelineOptions
GeneratorOptions o-- SamplingOptions GeneratorOptions o-- SamplingOptions
} }
@endum
package "Domain: Policy" { package "Domain: Policy" {
@@ -275,29 +275,33 @@ package "Infrastructure: Logging" {
+ level : LogLevel + level : LogLevel
+ phase : PipelinePhase + phase : PipelinePhase
+ message : std::string + message : std::string
+ city : std::optional<std::string>
+ entity_id : std::optional<std::string>
+ worker : std::optional<std::string> + worker : std::optional<std::string>
} }
interface ILogger <<interface>> { interface Logger <<interface>> {
+ Log(entry : const LogEntry&) : void + Log(level, phase, message,\n city, entity_id, worker) : void
} }
class LogProducer { class PipelineLogger {
- channel_ : BoundedChannel<LogEntry>& - log_ch_ : BoundedChannel<LogEntry>&
+ Log(entry : const LogEntry&) : void + Log(level, phase, message,\n city, entity_id, worker) : void
} }
class LogDispatcher { class LogWorker {
- channel_ : BoundedChannel<LogEntry>& - log_ch_ : BoundedChannel<LogEntry>&
+ Run() : void + Run() : void
- FormatTimestamp(tp) : std::string
- ToSpdlogLevel(level) : spdlog::level::level_enum - ToSpdlogLevel(level) : spdlog::level::level_enum
- ToString(phase) : std::string
} }
' --- Logging Relationships ---
LogEntry *-- LogLevel LogEntry *-- LogLevel
LogEntry *-- PipelinePhase LogEntry *-- PipelinePhase
ILogger <|.. LogProducer PipelineLogger ..> LogEntry : emits
LogProducer ..> LogEntry : emits LogWorker ..> LogEntry : consumes
LogDispatcher ..> LogEntry : consumes
} }
package "Infrastructure: Pipeline Channel" { package "Infrastructure: Pipeline Channel" {
@@ -352,29 +356,13 @@ package "Infrastructure: Enrichment" {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class HttpWebClient { class CURLWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
} }
package "Infrastructure: Prompting" {
interface IPromptDirectory <<interface>> {
+ Load(key : std::string_view) : std::string
}
class PromptDirectory {
- prompt_dir_ : std::filesystem::path
- cache_ : std::unordered_map<std::string, std::string>
+ PromptDirectory(prompt_dir : const std::filesystem::path&)
+ Load(key : std::string_view) : std::string
}
IPromptDirectory <|.. PromptDirectory
}
package "Infrastructure: Data Generation" { package "Infrastructure: Data Generation" {
interface DataGenerator <<interface>> { interface DataGenerator <<interface>> {
@@ -398,7 +386,6 @@ package "Infrastructure: Data Generation" {
- model_ : ModelHandle - model_ : ModelHandle
- context_ : ContextHandle - context_ : ContextHandle
- prompt_formatter_ : std::unique_ptr<PromptFormatter> - prompt_formatter_ : std::unique_ptr<PromptFormatter>
- prompt_directory_ : std::unique_ptr<IPromptDirectory>
- rng_ : std::mt19937 - rng_ : std::mt19937
+ GenerateBrewery(...) : BreweryResult + GenerateBrewery(...) : BreweryResult
+ GenerateBeer(...) : BeerResult + GenerateBeer(...) : BeerResult
@@ -472,6 +459,8 @@ package "Infrastructure: Data Export" {
} }
class BiergartenPipelineOrchestrator { class BiergartenPipelineOrchestrator {
- preloader_ : std::unique_ptr<DataPreloader> - preloader_ : std::unique_ptr<DataPreloader>
- enrichment_service_ : std::unique_ptr<EnrichmentService> - enrichment_service_ : std::unique_ptr<EnrichmentService>
@@ -531,7 +520,7 @@ CheckinDistributionStrategy <|.. RandomCheckinStrategy
FollowGenerationStrategy <|.. RandomFollowStrategy FollowGenerationStrategy <|.. RandomFollowStrategy
FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy
EnrichmentService <|.. WikipediaService EnrichmentService <|.. WikipediaService
WebClient <|.. HttpWebClient WebClient <|.. CURLWebClient
DataGenerator <|.. MockGenerator DataGenerator <|.. MockGenerator
DataGenerator <|.. LlamaGenerator DataGenerator <|.. LlamaGenerator
PromptFormatter <|.. Gemma4JinjaPromptFormatter PromptFormatter <|.. Gemma4JinjaPromptFormatter
@@ -542,7 +531,6 @@ DateTimeProvider <|.. SystemDateTimeProvider
WikipediaService *-- WebClient WikipediaService *-- WebClient
WikipediaService ..> ContextStrategy WikipediaService ..> ContextStrategy
LlamaGenerator *-- PromptFormatter LlamaGenerator *-- PromptFormatter
LlamaGenerator *-- IPromptDirectory
LlamaGenerator ..> GeneratorOptions LlamaGenerator ..> GeneratorOptions
SqliteExportService *-- DateTimeProvider SqliteExportService *-- DateTimeProvider

View File

@@ -1,9 +0,0 @@
build/
cmake-build-debug/
.git/
.idea/
**/*.sqlite
**/*.log
**/*.sqlite3
**/*.db

View File

@@ -1,20 +1,13 @@
cmake_minimum_required(VERSION 3.31) cmake_minimum_required(VERSION 3.24)
project(biergarten-pipeline) project(biergarten-pipeline)
# Set policy to allow FetchContent_Populate for header-only libraries set(CMAKE_POLICY_VERSION_MINIMUM 3.5 CACHE STRING "" FORCE)
# that have outdated CMakeLists.txt files
cmake_policy(SET CMP0169 OLD)
# 1. Build Options # =============================================================================
# 1. Platform & GPU Detection
option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF) # =============================================================================
if(BIERGARTEN_MOCK_ONLY) if(WIN32)
message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.") message(FATAL_ERROR "[biergarten] Windows is currently not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif()
# 2. Platform & GPU Detection
if(NOT UNIX)
message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif() endif()
if(APPLE) if(APPLE)
@@ -25,15 +18,15 @@ if(APPLE)
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.") message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE) set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif() endif()
else() elseif(UNIX AND NOT APPLE)
find_package(CUDAToolkit QUIET) find_package(CUDAToolkit QUIET)
find_package(hip CONFIG QUIET) find_package(HIP QUIET)
if(CUDAToolkit_FOUND) if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.") message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE) set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native) set(CMAKE_CUDA_ARCHITECTURES native)
elseif(hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm") elseif(HIP_FOUND OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.") message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE) set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else() else()
@@ -41,79 +34,71 @@ else()
endif() endif()
endif() endif()
# 3. Project-wide Settings # =============================================================================
# 2. Project-wide Settings (Standard & Optimization)
# =============================================================================
set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
# Release Build Optimization: Aggressive (-O3), Arch-specific, and LTO
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g") set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g")
# 4. Dependencies # =============================================================================
# 3. Dependencies
# =============================================================================
include(FetchContent) include(FetchContent)
# Boost (system install — via dnf/brew) find_package(CURL QUIET)
find_package(Boost REQUIRED COMPONENTS json program_options) if(NOT CURL_FOUND)
message(FATAL_ERROR "[biergarten] libcurl not found. Install it (e.g. 'sudo dnf install libcurl-devel').")
# Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency)
# Header-only library, so we only fetch without invoking its CMakeLists.txt
FetchContent_Declare(
boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0
GIT_SHALLOW TRUE
)
FetchContent_GetProperties(boost-di)
if(NOT boost-di_POPULATED)
FetchContent_Populate(boost-di)
endif() endif()
add_library(boost_di INTERFACE) # Require system Boost for JSON and Program Options to speed up build times
add_library(boost::di ALIAS boost_di) find_package(Boost REQUIRED COMPONENTS json program_options)
target_include_directories(boost_di INTERFACE
$<BUILD_INTERFACE:${boost-di_SOURCE_DIR}/include>
)
# SQLite amalgamation
FetchContent_Declare( FetchContent_Declare(
sqlite_amalgamation sqlite_amalgamation
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
EXCLUDE_FROM_ALL
) )
FetchContent_MakeAvailable(sqlite_amalgamation) FetchContent_GetProperties(sqlite_amalgamation)
if(NOT TARGET sqlite3) if(NOT sqlite_amalgamation_POPULATED)
add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c) FetchContent_Populate(sqlite_amalgamation)
target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR})
target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1)
endif() endif()
# llama.cpp — skipped for mock-only builds if(NOT TARGET sqlite3)
if(NOT BIERGARTEN_MOCK_ONLY) add_library(sqlite3 STATIC
find_library(LLAMA_LIB NAMES llama) ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c
find_library(GGML_LIB NAMES ggml)
find_library(GGML_BASE_LIB NAMES ggml-base)
find_path(LLAMA_INC_DIR NAMES llama.h PATH_SUFFIXES include)
if(LLAMA_LIB AND GGML_LIB AND GGML_BASE_LIB AND LLAMA_INC_DIR)
message(STATUS "[biergarten] Found system llama.cpp — skipping FetchContent")
add_library(llama SHARED IMPORTED)
set_target_properties(llama PROPERTIES
IMPORTED_LOCATION "${LLAMA_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INC_DIR}"
INTERFACE_LINK_LIBRARIES "${GGML_LIB};${GGML_BASE_LIB}"
) )
else() target_include_directories(sqlite3 PUBLIC
message(STATUS "[biergarten] System llama.cpp not found — fetching via FetchContent") ${sqlite_amalgamation_SOURCE_DIR}
FetchContent_Declare( )
target_compile_definitions(sqlite3 PUBLIC
SQLITE_THREADSAFE=1
)
endif()
FetchContent_Declare(
llama-cpp llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b9012 GIT_TAG b8742
) )
FetchContent_MakeAvailable(llama-cpp) FetchContent_MakeAvailable(llama-cpp)
endif()
FetchContent_Declare(
boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0
)
FetchContent_MakeAvailable(boost-di)
if(TARGET Boost.DI AND NOT TARGET boost::di)
add_library(boost::di ALIAS Boost.DI)
endif() endif()
# spdlog
FetchContent_Declare( FetchContent_Declare(
spdlog spdlog
GIT_REPOSITORY https://github.com/gabime/spdlog.git GIT_REPOSITORY https://github.com/gabime/spdlog.git
@@ -121,148 +106,73 @@ FetchContent_Declare(
) )
FetchContent_MakeAvailable(spdlog) FetchContent_MakeAvailable(spdlog)
# cpp-httplib — header-only HTTP/HTTPS client replacing libcurl. # =============================================================================
# OpenSSL is required for HTTPS (Wikipedia API). find_package locates # 4. Sources
# libssl/libcrypto; HTTPLIB_REQUIRE_OPENSSL causes a hard build failure # =============================================================================
# if OpenSSL is absent rather than silently producing an HTTP-only binary. set(SOURCES
find_package(OpenSSL REQUIRED)
FetchContent_Declare(
cpp-httplib
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib.git
GIT_TAG v0.43.2
GIT_SHALLOW TRUE
SYSTEM
)
set(HTTPLIB_REQUIRE_OPENSSL ON CACHE BOOL "Require OpenSSL for cpp-httplib" FORCE)
FetchContent_MakeAvailable(cpp-httplib)
# 5. Executable & Sources
add_executable(${PROJECT_NAME}
includes/services/enrichment/mock_enrichment.h
includes/json_handling/pretty_print.h)
# --- Entry point ---
target_sources(${PROJECT_NAME} PRIVATE
src/main.cc src/main.cc
) src/biergarten_data_generator/biergarten_data_generator.cc
src/biergarten_data_generator/run.cc
# --- json_handling --- src/biergarten_data_generator/query_cities_with_countries.cc
target_sources(${PROJECT_NAME} PRIVATE src/biergarten_data_generator/generate_breweries.cc
src/json_handling/json_loader.cc src/biergarten_data_generator/log_results.cc
) src/services/wikipedia/wikipedia_service.cc
src/services/wikipedia/get_summary.cc
# --- application_options --- src/services/wikipedia/fetch_extract.cc
target_sources(${PROJECT_NAME} PRIVATE src/services/sqlite/sqlite_export_service.cc
src/application_options/parse_arguments.cc src/services/sqlite/build_database_path.cc
) src/services/sqlite/process_record.cc
src/services/sqlite/initialize.cc
# --- biergarten_pipeline_orchestrator --- src/services/sqlite/finalize.cc
target_sources(${PROJECT_NAME} PRIVATE src/web_client/curl_global_state.cc
src/biergarten_pipeline_orchestrator/log_results.cc src/web_client/curl_web_client_get.cc
src/biergarten_pipeline_orchestrator/biergarten_pipeline_orchestrator.cc src/web_client/curl_web_client_url_encode.cc
src/biergarten_pipeline_orchestrator/generate_breweries.cc src/data_generation/llama/llama_generator.cc
src/biergarten_pipeline_orchestrator/run.cc src/data_generation/llama/generate_brewery.cc
src/biergarten_pipeline_orchestrator/query_cities_with_countries.cc src/data_generation/llama/generate_user.cc
) src/data_generation/llama/helpers.cc
src/data_generation/llama/infer.cc
# --- web_client --- src/data_generation/llama/load.cc
target_sources(${PROJECT_NAME} PRIVATE src/services/prompt_directory.cc
src/web_client/http_web_client.cc
)
# --- data_generation: prompt_formatting ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
) src/data_generation/mock/deterministic_hash.cc
# --- data_generation: mock ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/mock/generate_brewery.cc src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc src/data_generation/mock/generate_user.cc
src/data_generation/mock/deterministic_hash.cc src/json_handling/json_loader.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cpp
src/services/sqlite/helpers/sqlite_statement_helpers.cpp
) )
# --- data_generation: llama (skipped for mock-only builds) --- # =============================================================================
if(NOT BIERGARTEN_MOCK_ONLY) # 5. Target
target_sources(${PROJECT_NAME} PRIVATE # =============================================================================
src/data_generation/llama/load.cc add_executable(${PROJECT_NAME} ${SOURCES})
src/data_generation/llama/helpers.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_user.cc
)
endif()
# --- services: wikipedia ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/enrichment/wikipedia/wikipedia_service.cc
src/services/enrichment/wikipedia/fetch_extract.cc
src/services/enrichment/wikipedia/get_summary.cc
)
# --- services: sqlite ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/sqlite/process_record.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/finalize.cc
src/services/sqlite/initialize.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cc
src/services/sqlite/helpers/sqlite_statement_helpers.cc
)
# --- services: logging ---
target_sources(${PROJECT_NAME} PRIVATE
"src/services/logging/log_producer.cc"
src/services/logging/log_dispatcher.cc
)
# --- services (top-level) ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/prompt_directory.cc
)
# 6. Include Directories, Link Libraries & Compile Definitions
target_include_directories(${PROJECT_NAME} PRIVATE target_include_directories(${PROJECT_NAME} PRIVATE
includes includes
${llama-cpp_SOURCE_DIR}/include
${llama-cpp_SOURCE_DIR}/common
) )
target_link_libraries(${PROJECT_NAME} PRIVATE target_link_libraries(${PROJECT_NAME} PRIVATE
$<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:llama> llama
boost::di boost::di
Boost::json Boost::json
Boost::program_options Boost::program_options
spdlog::spdlog spdlog::spdlog
sqlite3 sqlite3
httplib::httplib CURL::libcurl
OpenSSL::SSL
OpenSSL::Crypto
) )
target_compile_definitions(${PROJECT_NAME} PRIVATE # =============================================================================
# Defined when -DBIERGARTEN_MOCK_ONLY=ON — skips llama.cpp entirely. # 6. Runtime Assets
# Use #ifdef BIERGARTEN_MOCK_ONLY in source to guard llama-specific code. # =============================================================================
$<$<BOOL:${BIERGARTEN_MOCK_ONLY}>:BIERGARTEN_MOCK_ONLY>
# Defined for Debug configuration builds.
# Use #ifdef DEBUG in source to enable debug-only behaviour (e.g. verbose logging).
$<$<CONFIG:Debug>:DEBUG>
)
target_compile_options(biergarten-pipeline PRIVATE
-fmacro-prefix-map=${CMAKE_SOURCE_DIR}/tooling/pipeline/src/=
)
# 7. Runtime Assets
configure_file( configure_file(
${CMAKE_SOURCE_DIR}/locations.json ${CMAKE_SOURCE_DIR}/locations.json
${CMAKE_BINARY_DIR}/locations.json ${CMAKE_BINARY_DIR}/locations.json
COPYONLY COPYONLY
) )
add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_directory COMMAND ${CMAKE_COMMAND} -E copy_directory
${CMAKE_SOURCE_DIR}/prompts ${CMAKE_SOURCE_DIR}/prompts
${CMAKE_BINARY_DIR}/prompts ${CMAKE_BINARY_DIR}/prompts
) )

View File

@@ -0,0 +1,83 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
#define BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
/**
* @file biergarten_data_generator.h
* @brief Core orchestration class for pipeline data generation.
*/
#include <memory>
#include <span>
#include <vector>
#include "data_generation/data_generator.h"
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#include "data_model/location.h"
#include "services/enrichment_service.h"
#include "services/export_service.h"
/**
* @brief Main data generator class for the Biergarten pipeline.
*
* This class encapsulates the core logic for generating brewery data.
* It handles location loading, city enrichment, and brewery generation.
*/
class BiergartenDataGenerator {
public:
/**
* @brief Construct a BiergartenDataGenerator with injected dependencies.
*
* @param context_service Context provider for sampled locations.
* @param generator Brewery and user data generator.
* @param exporter Storage backend for generated brewery data.
*/
BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter);
/**
* @brief Run the data generation pipeline.
*
* Performs the following steps:
* 1. Load curated locations from JSON
* 2. Resolve context for each city using the injected context service
* 3. Generate brewery data for sampled cities
*
* @return true if successful, false if not
*/
bool Run();
private:
/// @brief Owning context provider dependency.
std::unique_ptr<IEnrichmentService> context_service_;
/// @brief Generator dependency selected in the composition root.
std::unique_ptr<DataGenerator> generator_;
/// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_;
/**
* @brief Load locations from JSON and sample cities.
*
* @return Vector of sampled locations capped at 50 entries.
*/
static std::vector<Location> QueryCitiesWithCountries();
/**
* @brief Generate breweries for enriched cities.
*
* @param cities Span of enriched city data.
*/
void GenerateBreweries(std::span<const EnrichedCity> cities);
/**
* @brief Log the generated brewery results.
*/
void LogResults() const;
/// @brief Stores generated brewery data.
std::vector<GeneratedBrewery> generated_breweries_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_

View File

@@ -1,102 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
#define BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
/**
* @file biergarten_data_generator.h
* @brief Orchestration for end-to-end brewery data generation pipeline.
*
* Intent: Coordinates location loading, enrichment, and generation phases
* to produce a complete dataset. Coordinates dependencies via composition root.
*/
#include <memory>
#include <span>
#include <vector>
#include "data_generation/data_generator.h"
#include "data_model/generated_models.h"
#include "services/database/export_service.h"
#include "services/enrichment/enrichment_service.h"
#include "services/logging/logger.h"
/**
* @brief Main data generator class for the Biergarten pipeline.
*
* This class encapsulates the core logic for generating brewery data.
* It handles location loading, city enrichment, and brewery generation.
*/
class BiergartenPipelineOrchestrator {
public:
/**
* @brief Constructs the orchestrator with injected pipeline dependencies.
*
* @param context_service Provides regional context for locations.
* @param generator Implementation (Llama or Mock) for brewery/user generation.
* @param exporter Database backend for persisting generated records.
* @param application_options CLI configuration and paths.
*/
BiergartenPipelineOrchestrator(
std::shared_ptr<ILogger> logger,
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter,
const ApplicationOptions& application_options);
/**
* @brief Run the data generation pipeline.
*
* Performs the following steps:
* 1. Load curated locations from JSON
* 2. Resolve context for each city using the injected context service
* 3. Generate brewery data for sampled cities
*
* @note STRUCTURAL CONCURRENCY REQUIREMENT:
* When transitioned to a multithreaded design, this method MUST structurally
* enforce that all deployed worker threads are joined before returning (e.g.
* by using std::jthread or a structured concurrency primitive). This ensures
* workers do not attempt to log to a closed channel during application teardown.
*
* @return true if successful, false if not
*/
bool Run();
private:
/// @brief Logger instance for emitting pipeline messages.
std::shared_ptr<ILogger> logger_;
/// @brief Owning context provider dependency.
std::unique_ptr<IEnrichmentService> context_service_;
/// @brief Generator dependency selected in the composition root.
std::unique_ptr<DataGenerator> generator_;
/// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_;
/// @brief CLI configuration: paths, model settings, generation parameters.
ApplicationOptions application_options_;
/**
* @brief Load locations from JSON and sample cities.
*
* @return Vector of sampled locations capped at 50 entries.
*/
std::vector<Location> QueryCitiesWithCountries();
/**
* @brief Generate breweries for enriched cities.
*
* @param cities Span of enriched city data.
*/
void GenerateBreweries(std::span<const EnrichedCity> cities);
/**
* @brief Log the generated brewery results.
*/
void LogResults() const;
/// @brief Stores generated brewery data.
std::vector<GeneratedBrewery> generated_breweries_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_

View File

@@ -1,73 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_CONCURRENCY_BOUNDED_CHANNEL_H_
#define BIERGARTEN_PIPELINE_INCLUDES_CONCURRENCY_BOUNDED_CHANNEL_H_
#include <condition_variable>
#include <cstddef>
#include <mutex>
#include <optional>
#include <queue>
/**
* @file bounded_channel.h
* @brief Thread-safe, bounded multi-producer/multi-consumer synchronous channel.
*
* Intent: Enables asynchronous inter-thread communication with backpressure.
* Models a synchronous channel where producers/consumers block on capacity limits.
*/
/**
* @class BoundedChannel
* @brief MPMC channel with fixed capacity and blocking semantics.
*
* Producers block when buffer is full; consumers block when empty.
* Close() unblocks all waiters and signals channel exhaustion.
*/
template <typename T>
class BoundedChannel {
// -------------------------------------------------------------------------
// Internal state — all access must be guarded by mutex_.
// -------------------------------------------------------------------------
std::queue<T> queue_;
std::mutex mutex_;
std::condition_variable not_full_;
std::condition_variable not_empty_;
std::size_t capacity_;
bool closed_ = false;
public:
/**
* @brief Construct a bounded channel with the given capacity.
* @param capacity Maximum number of items the channel may hold.
*/
explicit BoundedChannel(std::size_t capacity) : capacity_(capacity) {}
/**
* @brief Send an item into the channel. Blocks when the channel is full.
* @param item Move-only item to enqueue.
*/
void Send(T item);
/**
* @brief Receive an item from the channel. Blocks when the channel is
* empty.
* @return std::optional<T> containing the item, or std::nullopt when the
* channel is closed and drained.
*/
std::optional<T> Receive();
/**
* @brief Close the channel and unblock all waiting threads. Idempotent.
*/
void Close();
};
// Include the template implementation
#include "bounded_channel.tcc"
#endif // BIERGARTEN_PIPELINE_INCLUDES_CONCURRENCY_BOUNDED_CHANNEL_H_

View File

@@ -1,57 +0,0 @@
#include "bounded_channel.h"
template <typename T>
void BoundedChannel<T>::Send(T item) {
// Acquire exclusive ownership of the mutex; released automatically on scope exit.
std::unique_lock lock(mutex_);
// Block until there is space in the queue or the channel has been closed.
// The predicate guards against spurious wakeups.
not_full_.wait(lock, [&] { return queue_.size() < capacity_ || closed_; });
// If the channel was closed while waiting, discard the item and return.
if (closed_) return;
// Move the item into the queue to avoid an unnecessary copy.
queue_.push(std::move(item));
// Wake one blocked Receive() call to signal that data is now available.
not_empty_.notify_one();
}
template <typename T>
std::optional<T> BoundedChannel<T>::Receive() {
// Acquire exclusive ownership of the mutex.
std::unique_lock lock(mutex_);
// Block until the queue is non-empty or the channel has been closed.
// The predicate guards against spurious wakeups.
not_empty_.wait(lock, [&] { return !queue_.empty() || closed_; });
// If woken due to closure and no items remain, signal exhaustion via nullopt.
if (queue_.empty()) return std::nullopt;
// Move the front item out of the queue to avoid an unnecessary copy.
T item = std::move(queue_.front());
queue_.pop();
// Wake one blocked Send() call to signal that a slot has opened.
not_full_.notify_one();
return item;
}
template <typename T>
void BoundedChannel<T>::Close() {
// Acquire exclusive ownership of the mutex to ensure visibility of the flag.
std::unique_lock lock(mutex_);
// Mark the channel as closed; subsequent Send() calls will be dropped.
closed_ = true;
// Wake all blocked Send() callers so they can observe the closed flag and exit.
not_full_.notify_all();
// Wake all blocked Receive() callers so they can drain remaining items or return nullopt.
not_empty_.notify_all();
}

View File

@@ -8,7 +8,9 @@
#include <string> #include <string>
#include "data_model/generated_models.h" #include "data_model/brewery_result.h"
#include "data_model/location.h"
#include "data_model/user_result.h"
/** /**
* @brief Interface for data generator implementations. * @brief Interface for data generator implementations.

View File

@@ -14,11 +14,10 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "../services/prompting/prompt_directory.h"
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_generation/prompt_formatting/prompt_formatter.h" #include "data_generation/prompt_formatting/prompt_formatter.h"
#include "data_model/models.h" #include "data_model/application_options.h"
#include "services/logging/logger.h" #include "services/prompt_directory.h"
struct llama_model; struct llama_model;
struct llama_context; struct llama_context;
@@ -38,7 +37,7 @@ class LlamaGenerator final : public DataGenerator {
* @param prompt_directory Directory service for loading named prompt files. * @param prompt_directory Directory service for loading named prompt files.
*/ */
LlamaGenerator(const ApplicationOptions& options, LlamaGenerator(const ApplicationOptions& options,
const std::string& model_path, std::shared_ptr<ILogger> logger, const std::string& model_path,
std::unique_ptr<IPromptFormatter> prompt_formatter, std::unique_ptr<IPromptFormatter> prompt_formatter,
std::unique_ptr<IPromptDirectory> prompt_directory); std::unique_ptr<IPromptDirectory> prompt_directory);
@@ -130,8 +129,6 @@ class LlamaGenerator final : public DataGenerator {
uint32_t sampling_top_k_ = kDefaultSamplingTopK; uint32_t sampling_top_k_ = kDefaultSamplingTopK;
std::mt19937 rng_; std::mt19937 rng_;
uint32_t n_ctx_ = kDefaultContextSize; uint32_t n_ctx_ = kDefaultContextSize;
int n_gpu_layers_ = 0;
std::shared_ptr<ILogger> logger_;
std::unique_ptr<IPromptFormatter> prompt_formatter_; std::unique_ptr<IPromptFormatter> prompt_formatter_;
std::unique_ptr<IPromptDirectory> prompt_directory_; std::unique_ptr<IPromptDirectory> prompt_directory_;
}; };

View File

@@ -12,7 +12,7 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "data_model/generated_models.h" #include "data_model/brewery_result.h"
struct llama_vocab; struct llama_vocab;
using llama_token = int32_t; using llama_token = int32_t;

View File

@@ -1,5 +1,4 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_ #pragma once
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -14,5 +13,3 @@ class Gemma4JinjaPromptFormatter final : public IPromptFormatter {
[[nodiscard]] std::string Format(std::string_view system_prompt, [[nodiscard]] std::string Format(std::string_view system_prompt,
std::string_view user_prompt) const override; std::string_view user_prompt) const override;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_

View File

@@ -1,5 +1,4 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_ #pragma once
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -16,5 +15,3 @@ class IPromptFormatter {
[[nodiscard]] virtual std::string Format( [[nodiscard]] virtual std::string Format(
std::string_view system_prompt, std::string_view user_prompt) const = 0; std::string_view system_prompt, std::string_view user_prompt) const = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_

View File

@@ -0,0 +1,72 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
/**
* @file data_model/application_options.h
* @brief Program options for the Biergarten pipeline application.
*/
#include <cstdint>
#include <filesystem>
#include <optional>
#include <string>
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_

View File

@@ -0,0 +1,22 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
/**
* @file data_model/brewery_location.h
* @brief Non-owning brewery location input.
*/
#include <string_view>
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_

View File

@@ -0,0 +1,28 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
/**
* @file data_model/brewery_result.h
* @brief Generated brewery payload.
*/
#include <string>
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_

View File

@@ -0,0 +1,21 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
/**
* @file data_model/enriched_city.h
* @brief Enriched city data with Wikipedia context.
*/
#include <string>
#include "data_model/location.h"
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_

View File

@@ -0,0 +1,20 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
/**
* @file data_model/generated_brewery.h
* @brief Helper struct to store generated brewery data.
*/
#include "data_model/brewery_result.h"
#include "data_model/location.h"
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_

View File

@@ -1,66 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
/**
* @file data_model/generated_models.h
* @brief Generated output models from the pipeline: brewery/user results, enriched data,
* and complete generation results.
*/
#include <string>
#include "data_model/models.h"
// ============================================================================
// Generation Output Models
// ============================================================================
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
// ============================================================================
// Pipeline Data Models
// ============================================================================
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_

View File

@@ -0,0 +1,13 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
/**
* @file data_model/generation_models.h
* @brief Convenience include for shared generation payload models.
*/
#include "data_model/brewery_location.h"
#include "data_model/brewery_result.h"
#include "data_model/user_result.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_

View File

@@ -0,0 +1,41 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
/**
* @file data_model/location.h
* @brief Location data model used throughout generation pipeline.
*/
#include <string>
#include <vector>
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_

View File

@@ -1,145 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
/**
* @file data_model/models.h
* @brief Core data models: locations, application configuration, and generation
* inputs.
*/
#include <boost/program_options.hpp>
#include <cstdint>
#include <filesystem>
#include <memory>
#include <optional>
#include <string>
#include <string_view>
#include <vector>
class ILogger;
namespace prog_opts = boost::program_options;
// ============================================================================
// Location Models
// ============================================================================
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
// ============================================================================
// Configuration Models
// ============================================================================
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
/// @brief Number of layers to offload to GPU.
int n_gpu_layers = 0;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
/// @brief Number of locations to sample from the dataset
/// More locations -> more users/more breweries
uint32_t location_count;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
// ============================================================================
// Function Declarations
// ============================================================================
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv,
std::shared_ptr<ILogger> logger = nullptr);
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_

View File

@@ -0,0 +1,12 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
/**
* @file data_model/pipeline_models.h
* @brief Convenience include for pipeline-specific data models.
*/
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_

View File

@@ -0,0 +1,22 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
/**
* @file data_model/user_result.h
* @brief Generated user profile payload.
*/
#include <string>
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_

View File

@@ -7,19 +7,16 @@
*/ */
#include <filesystem> #include <filesystem>
#include <memory>
#include <vector> #include <vector>
#include "data_model/models.h" #include "data_model/location.h"
#include "services/logging/logger.h"
/// @brief Loads curated world locations from a JSON file into memory. /// @brief Loads curated world locations from a JSON file into memory.
class JsonLoader { class JsonLoader {
public: public:
/// @brief Parses a JSON array file and returns all location records. /// @brief Parses a JSON array file and returns all location records.
static std::vector<Location> LoadLocations( static std::vector<Location> LoadLocations(
const std::filesystem::path& filepath, const std::filesystem::path& filepath);
std::shared_ptr<ILogger> logger = nullptr);
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_JSON_LOADER_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_JSON_LOADER_H_

View File

@@ -1,109 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_PRETTY_PRINT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_PRETTY_PRINT_H_
/**
* @file json_handling/pretty_print.h
* @brief Pretty-printing utilities for JSON values.
*
* Provides formatting capability for boost::json::value with indentation and
* readable output. Adapted from Boost JSON library examples.
*/
#include <boost/json.hpp>
#include <ostream>
#include <string>
/**
* @brief Pretty-prints a JSON value to an output stream with indentation.
*
* Recursively formats JSON objects and arrays with consistent 4-space
* indentation. Adapted from:
* https://raw.githubusercontent.com/boostorg/json/refs/heads/develop/example/pretty.cpp
*
* @param outstream Output stream to write formatted JSON.
* @param json_val JSON value to format.
* @param indent Optional indentation string (managed internally on first call).
*/
inline void PrettyPrint(std::ostream& outstream,
boost::json::value const& json_val,
std::string* indent = nullptr) {
std::string str;
if (indent == nullptr) {
indent = &str;
}
switch (json_val.kind()) {
case boost::json::kind::object: {
outstream << "{\n";
indent->append(4, ' ');
auto const& obj = json_val.get_object();
if (!obj.empty()) {
const auto* iter = obj.begin();
for (;;) {
outstream << *indent << boost::json::serialize(iter->key()) << " : ";
PrettyPrint(outstream, iter->value(), indent);
iter = std::next(iter);
if (iter == obj.end()) {
break;
}
outstream << ",\n";
}
}
outstream << "\n";
indent->resize(indent->size() - 4);
outstream << *indent << "}";
break;
}
case boost::json::kind::array: {
outstream << "[\n";
indent->append(4, ' ');
auto const& arr = json_val.get_array();
if (!arr.empty()) {
const auto* iter = arr.begin();
for (;;) {
outstream << *indent;
PrettyPrint(outstream, *iter, indent);
iter = std::next(iter);
if (iter == arr.end()) {
break;
}
outstream << ",\n";
}
}
outstream << "\n";
indent->resize(indent->size() - 4);
outstream << *indent << "]";
break;
}
case boost::json::kind::string: {
outstream << serialize(json_val.get_string());
break;
}
case boost::json::kind::uint64:
case boost::json::kind::int64:
case boost::json::kind::double_:
outstream << json_val;
break;
case boost::json::kind::bool_:
if (json_val.get_bool()) {
outstream << "true";
} else {
outstream << "false";
}
break;
case boost::json::kind::null:
outstream << "null";
break;
}
if (indent->empty()) {
outstream << "\n";
}
}
#endif

View File

@@ -1,10 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "sqlite_connection_helpers.h"
#include "sqlite_handle_types.h"
#include "sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_
/** /**
* @file services/date_time_provider.h * @file services/date_time_provider.h
@@ -63,4 +63,4 @@ class SystemDateTimeProvider final : public IDateTimeProvider {
} }
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_

View File

@@ -1,35 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#include <chrono>
/**
* @file services/timer.h
* @brief Simple timer utility for measuring elapsed time.
*/
class Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
public:
Timer(const Timer&) = delete;
Timer& operator=(const Timer&) = delete;
Timer(Timer&&) = delete;
Timer& operator=(Timer&&) = delete;
Timer() = default;
~Timer() = default;
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
[[nodiscard]] int64_t Reset() {
auto previous_elapsed = Elapsed();
start_time = std::chrono::steady_clock::now();
return previous_elapsed;
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_

View File

@@ -1,17 +0,0 @@
//
// Created by aaronpo on 13/05/2026.
//
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#include <string>
#include "enrichment_service.h"
class MockEnrichmentService final : public IEnrichmentService {
public:
std::string GetLocationContext(const Location& /*loc*/) override {
return {};
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_
/** /**
* @file services/enrichment_service.h * @file services/enrichment_service.h
@@ -8,7 +8,7 @@
#include <string> #include <string>
#include "data_model/models.h" #include "data_model/location.h"
/** /**
* @brief Interface for services that can enrich a location with context. * @brief Interface for services that can enrich a location with context.
@@ -27,4 +27,4 @@ class IEnrichmentService {
virtual std::string GetLocationContext(const Location& loc) = 0; virtual std::string GetLocationContext(const Location& loc) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_
/** /**
* @file services/export_service.h * @file services/export_service.h
@@ -8,7 +8,7 @@
#include <cstdint> #include <cstdint>
#include "data_model/generated_models.h" #include "data_model/generated_brewery.h"
/** /**
* @brief Interface for services that persist generated brewery records. * @brief Interface for services that persist generated brewery records.
@@ -39,4 +39,4 @@ class IExportService {
virtual void Finalize() = 0; virtual void Finalize() = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_

View File

@@ -1,53 +0,0 @@
/**
* @file services/logging/log_dispatcher.h
* @brief Dedicated log dispatcher for asynchronous pipeline logging.
*
* The dispatcher drains LogEntry values from a bounded channel and forwards
* them to spdlog on a dedicated thread.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_DISPATCHER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_DISPATCHER_H_
#include <spdlog/spdlog.h>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
/**
* @class LogDispatcher
* @brief Consumes log entries from a channel and forwards them to spdlog.
*
* Non-copyable and non-movable. Intended to run on its own dedicated thread
* and exit once the channel has been closed and drained.
*/
class LogDispatcher {
public:
/**
* @brief Construct a log dispatcher.
*
* @param channel Reference to the bounded channel used for log retrieval.
*/
explicit LogDispatcher(BoundedChannel<LogEntry>& channel);
LogDispatcher(const LogDispatcher&) = delete;
LogDispatcher& operator=(const LogDispatcher&) = delete;
LogDispatcher(LogDispatcher&&) = delete;
LogDispatcher& operator=(LogDispatcher&&) = delete;
~LogDispatcher() = default;
/**
* @brief Drain the channel and forward entries to spdlog.
*
* Intended to be called once on a dedicated thread. The loop returns after
* the channel has been closed and all queued entries have been processed.
*/
void Run();
private:
BoundedChannel<LogEntry>& channel_;
static spdlog::level::level_enum ToSpdlogLevel(LogLevel level);
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_DISPATCHER_H_

View File

@@ -1,88 +0,0 @@
/**
* @file services/logging/log_entry.h
* @brief Structured log record shared by the pipeline logging infra.
*
* LogEntry is a lightweight value type that can be passed safely between the
* logging producer and dispatcher through BoundedChannel<LogEntry>.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_ENTRY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_ENTRY_H_
#include <chrono>
#include <source_location>
#include <string>
#include <thread>
#include <vector>
/**
* @enum LogLevel
* @brief Severity levels supported by the logging infra.
*/
enum class LogLevel {
Debug, ///< Development/debugging information.
Info, ///< General informational messages.
Warn, ///< Warning conditions.
Error, ///< Error conditions.
};
/**
* @enum PipelinePhase
* @brief Pipeline execution phases used to tag log records.
*
* The phase tag makes it easier to correlate log output with the part of the
* pipeline that emitted it.
*/
enum class PipelinePhase {
Startup, ///< Initialization and validation.
UserGeneration, ///< User profile generation.
BreweryAndBeerGeneration, ///< Brewery and beer data generation.
CheckinGeneration, ///< Checkin (visit) record generation.
RatingGeneration, ///< Rating and review generation.
FollowGeneration, ///< Follow relationship generation.
Teardown, ///< Finalization and cleanup.
};
/**
* @struct LogDTO
* @brief User-provided subset of log fields. Used to capture call-site info transparently.
*/
struct LogDTO {
LogLevel level;
PipelinePhase phase;
std::string message;
};
/**
* @struct LogEntry
* @brief Single structured log event.
*
* All fields are value types, which keeps transfer across the bounded channel
* simple and avoids shared ownership.
*
* NOTE: timestamp, thread_id, and origin must be populated by ILogger::Log()
* before the entry is dispatched.
*/
struct LogEntry {
/// @brief Timestamp when the entry was created.
std::chrono::system_clock::time_point timestamp{};
/// @brief Source location where the log call was made.
std::source_location origin{};
/// @brief Thread responsible for emitting the log.
std::thread::id thread_id{};
/// @brief Severity level of this entry.
LogLevel level;
/// @brief Pipeline phase associated with the entry.
PipelinePhase phase;
/// @brief Log message text.
std::string message;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_ENTRY_H_

View File

@@ -1,53 +0,0 @@
/**
* @file services/logging/log_producer.h
* @brief Channel-backed log producer for asynchronous pipeline logging.
*
* The producer captures log records from application code and forwards them to
* a bounded channel for later processing by the dispatcher.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_CHANNEL_LOGGER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_CHANNEL_LOGGER_H_
#include <string_view>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
#include "services/logging/logger.h"
/**
* @class LogProducer
* @brief ILogger implementation that forwards entries to a bounded channel.
*
* Non-copyable and non-movable. The channel reference is non-owning and must
* remain valid for the lifetime of the producer.
*/
class LogProducer final : public ILogger {
public:
/**
* @brief Construct a channel-backed producer.
*
* @param channel Reference to the bounded channel used for log transfer.
*/
explicit LogProducer(BoundedChannel<LogEntry>& channel);
LogProducer(const LogProducer&) = delete;
LogProducer& operator=(const LogProducer&) = delete;
LogProducer(LogProducer&&) = delete;
LogProducer& operator=(LogProducer&&) = delete;
~LogProducer() override = default;
/**
* @brief Queue a log message for asynchronous processing.
*
* Blocks while the channel applies backpressure. This blocking behavior
* under heavy load is an accepted trade-off for simplicity.
*/
void DoLog(LogEntry log_entry) override;
private:
BoundedChannel<LogEntry>& channel_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_CHANNEL_LOGGER_H_

View File

@@ -1,64 +0,0 @@
/**
* @file services/logging/logger.h
* @brief Abstract logging interface used by pipeline components.
*
* The interface keeps application code independent from the concrete logging
* transport, buffering, and formatting implementation.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOGGER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOGGER_H_
#include <source_location>
#include <string>
#include <utility>
#include "services/logging/log_entry.h"
/**
* @class ILogger
* @brief Minimal interface for submitting structured log messages.
*
* Implementations are non-copyable and non-movable. They are typically owned
* by the composition root and injected into services that emit diagnostics.
*/
class ILogger {
public:
ILogger() = default;
ILogger(const ILogger&) = delete;
ILogger& operator=(const ILogger&) = delete;
ILogger(ILogger&&) = delete;
ILogger& operator=(ILogger&&) = delete;
virtual ~ILogger() = default;
/**
* @brief Submit a log message to the logging subsystem.
*
* @param payload User-provided log data (level, phase, message).
* @param origin Auto-captured source location of the call site.
*/
void Log(LogDTO payload,
std::source_location origin = std::source_location::current(),
std::chrono::system_clock::time_point timestamp = std::chrono::system_clock::now(),
std::thread::id thread_id = std::this_thread::get_id()) {
LogEntry entry;
entry.timestamp = timestamp;
entry.thread_id = thread_id;
entry.level = payload.level;
entry.phase = payload.phase;
entry.message = std::move(payload.message);
entry.origin = origin;
DoLog(std::move(entry));
}
protected:
/**
* @brief Underlying implementation to transport the log entry.
*
* Implementations must be thread-safe as DoLog can be called concurrently
* from multiple worker threads.
*/
virtual void DoLog(LogEntry log_entry) = 0;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOGGER_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_
/** /**
* @file services/prompt_directory.h * @file services/prompt_directory.h
@@ -12,14 +12,11 @@
*/ */
#include <filesystem> #include <filesystem>
#include <memory>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include <string_view> #include <string_view>
#include <unordered_map> #include <unordered_map>
#include "services/logging/logger.h"
/** /**
* @brief Interface for loading named prompt files. * @brief Interface for loading named prompt files.
*/ */
@@ -59,8 +56,6 @@ class PromptDirectory final : public IPromptDirectory {
* directory. * directory.
*/ */
explicit PromptDirectory(const std::filesystem::path& prompt_dir); explicit PromptDirectory(const std::filesystem::path& prompt_dir);
PromptDirectory(const std::filesystem::path& prompt_dir,
std::shared_ptr<ILogger> logger);
/** /**
* @brief Loads the prompt for @p key, caching the result. * @brief Loads the prompt for @p key, caching the result.
@@ -75,8 +70,7 @@ class PromptDirectory final : public IPromptDirectory {
private: private:
std::filesystem::path prompt_dir_; std::filesystem::path prompt_dir_;
std::shared_ptr<ILogger> logger_;
std::unordered_map<std::string, std::string> cache_; std::unordered_map<std::string, std::string> cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_
/** /**
* @file services/sqlite_connection_helpers.h * @file services/sqlite_connection_helpers.h
@@ -12,7 +12,7 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "sqlite_handle_types.h" #include "services/sqlite_handle_types.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -27,4 +27,4 @@ void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept;
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_
/** /**
* @file services/sqlite_export_service.h * @file services/sqlite_export_service.h
@@ -11,10 +11,10 @@
#include <string> #include <string>
#include <unordered_map> #include <unordered_map>
#include "data_model/models.h" #include "data_model/application_options.h"
#include "../datetime/date_time_provider.h" #include "services/date_time_provider.h"
#include "export_service.h" #include "services/export_service.h"
#include "sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
/** /**
* @brief Persists generated brewery records into a fresh SQLite database. * @brief Persists generated brewery records into a fresh SQLite database.
@@ -57,4 +57,4 @@ class SqliteExportService final : public IExportService {
std::unordered_map<std::string, sqlite3_int64> location_cache_; std::unordered_map<std::string, sqlite3_int64> location_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_

View File

@@ -0,0 +1,10 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "services/sqlite_connection_helpers.h"
#include "services/sqlite_handle_types.h"
#include "services/sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_
/** /**
* Shared handle and parameter type declarations used by SQLite helper units. * Shared handle and parameter type declarations used by SQLite helper units.
@@ -33,4 +33,4 @@ struct BindParam {
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_
/** /**
* @file services/sqlite_statement_helpers.h * @file services/sqlite_statement_helpers.h
@@ -13,7 +13,7 @@
#include <string_view> #include <string_view>
#include <vector> #include <vector>
#include "sqlite_handle_types.h" #include "services/sqlite_handle_types.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -113,4 +113,4 @@ std::string SerializeVector(const std::vector<std::string>& str_vec);
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_
/** /**
* @file services/wikipedia_service.h * @file services/wikipedia_service.h
@@ -11,16 +11,14 @@
#include <string_view> #include <string_view>
#include <unordered_map> #include <unordered_map>
#include "enrichment_service.h" #include "services/enrichment_service.h"
#include "services/logging/logger.h"
#include "web_client/web_client.h" #include "web_client/web_client.h"
/// @brief Provides Wikipedia summary lookups backed by cached raw extracts. /// @brief Provides Wikipedia summary lookups backed by cached raw extracts.
class WikipediaEnrichmentService final : public IEnrichmentService { class WikipediaService final : public IEnrichmentService {
public: public:
/// @brief Creates a new Wikipedia service with the provided web client. /// @brief Creates a new Wikipedia service with the provided web client.
explicit WikipediaEnrichmentService(std::unique_ptr<WebClient> client, explicit WikipediaService(std::unique_ptr<WebClient> client);
std::shared_ptr<ILogger> logger);
/// @brief Returns the Wikipedia-derived context for a location. /// @brief Returns the Wikipedia-derived context for a location.
[[nodiscard]] std::string GetLocationContext(const Location& loc) override; [[nodiscard]] std::string GetLocationContext(const Location& loc) override;
@@ -28,9 +26,8 @@ class WikipediaEnrichmentService final : public IEnrichmentService {
private: private:
std::string FetchExtract(std::string_view query); std::string FetchExtract(std::string_view query);
std::unique_ptr<WebClient> client_; std::unique_ptr<WebClient> client_;
std::shared_ptr<ILogger> logger_;
/// @brief Canonical cache for raw Wikipedia query extracts. /// @brief Canonical cache for raw Wikipedia query extracts.
std::unordered_map<std::string, std::string> extract_cache_; std::unordered_map<std::string, std::string> extract_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_

View File

@@ -0,0 +1,54 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
/**
* @file web_client/curl_web_client.h
* @brief libcurl-based WebClient implementation.
*/
#include "web_client/web_client.h"
/**
* @brief RAII wrapper for curl_global_init and curl_global_cleanup.
*
* Create one instance in application startup before using libcurl and keep it
* alive for application lifetime.
*/
class CurlGlobalState {
public:
/// @brief Initializes global libcurl state.
CurlGlobalState();
/// @brief Cleans up global libcurl state.
~CurlGlobalState();
/// @brief Non-copyable type.
CurlGlobalState(const CurlGlobalState&) = delete;
/// @brief Non-copyable type.
CurlGlobalState& operator=(const CurlGlobalState&) = delete;
};
/**
* @brief WebClient implementation backed by libcurl.
*/
class CURLWebClient : public WebClient {
public:
/**
* @brief Executes an HTTP GET request.
*
* @param url Request URL.
* @return Response body.
*/
std::string Get(const std::string& url) override;
/**
* @brief URL-encodes a string value.
*
* @param value Raw value.
* @return URL-encoded string.
*/
std::string UrlEncode(const std::string& value) override;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_

View File

@@ -1,56 +0,0 @@
/**
* @file web_client/http_web_client.h
* @brief cpp-httplib implementation of the WebClient interface.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#include "web_client/web_client.h"
#include "services/logging/logger.h"
#include <memory>
#include <string>
#include <utility>
/**
* @brief WebClient implementation backed by cpp-httplib.
*
* Supports HTTP and HTTPS (requires OpenSSL; see HTTPLIB_REQUIRE_OPENSSL
* in CMakeLists.txt).
*
* URL parsing splits a full URL into origin (scheme://host[:port]) and
* path + query so that httplib::Client can be constructed correctly.
* A new client instance is created per request because the client is
* bound to a single origin at construction time.
*/
class HttpWebClient final : public WebClient {
public:
explicit HttpWebClient(std::shared_ptr<ILogger> logger)
: logger_(std::move(logger)) {}
~HttpWebClient() override = default;
/**
* @brief Executes a blocking HTTP/HTTPS GET request against a full URL.
*
* @param url Fully-qualified URL, e.g. "https://en.wikipedia.org/api/rest_v1/page/summary/Berlin"
* @return Response body on HTTP 2xx; throws std::runtime_error otherwise.
*/
std::string Get(const std::string& url) override;
/**
* @brief Percent-encodes a single URI component (query parameter value or
* path segment). Delegates to httplib::encode_uri_component().
*
* @param value Raw string to encode.
* @return Percent-encoded string safe for use in a URL.
*/
std::string EncodeURL(const std::string& value) override;
private:
std::shared_ptr<ILogger> logger_;
};
#endif

View File

@@ -30,7 +30,7 @@ class WebClient {
* @param value Raw string value. * @param value Raw string value.
* @return Encoded value safe for URL usage. * @return Encoded value safe for URL usage.
*/ */
virtual std::string EncodeURL(const std::string& value) = 0; virtual std::string UrlEncode(const std::string& value) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_

View File

@@ -1,9 +0,0 @@
# Ignore model files!
*.gguf
*.bin
models/
weights/
# Ignore local build folders
build/
.git/

View File

@@ -1,72 +0,0 @@
# --- Stage 1: Build Environment (The "Heavy" Stage) ---
FROM nvidia/cuda:12.6.3-devel-ubuntu24.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive \
CMAKE_GENERATOR=Ninja
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential ca-certificates curl git libboost-json-dev \
libboost-program-options-dev libssl-dev ninja-build pkg-config zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install modern CMake
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.31.0/cmake-3.31.0-linux-x86_64.sh -o cmake.sh && \
sh cmake.sh --skip-license --prefix=/usr/local && rm cmake.sh
# Get headers for C++ build
RUN curl -L https://github.com/ggml-org/llama.cpp/archive/refs/tags/b9012.tar.gz -o /tmp/llama-src.tar.gz && \
tar -xzf /tmp/llama-src.tar.gz -C /tmp && \
cp -r /tmp/llama.cpp-b9012/include/* /usr/local/include/ && \
cp -r /tmp/llama.cpp-b9012/ggml/include/* /usr/local/include/
# Pull llama.cpp binaries to use during build if needed
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
WORKDIR /app
COPY . .
# Build the C++ pipeline
RUN cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc)
# --- Stage 2: Runtime Environment (The "Slim" Stage) ---
FROM nvidia/cuda:12.6.3-runtime-ubuntu24.04 AS runtime
# Install only necessary runtime shared libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
libboost-json1.83.0 \
libboost-program-options1.83.0 \
libgomp1 \
libssl3 \
zlib1g \
&& rm -rf /var/lib/apt/lists/*
ENV APP_ROOT=/app \
LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"
WORKDIR /app/build
# Copy only the compiled binaries from the builder
COPY --from=builder /app/build/biergarten-pipeline ./
# Copy required config files
COPY locations.json /app/build/
COPY beer-styles.json /app/build/
# Copy prompt templates
COPY prompts /app/prompts
# Copy only the necessary shared libraries from builder/llama-bin
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
# Co-locate plugins
RUN cp /usr/local/lib/libggml-cuda.so . 2>/dev/null || true && \
cp /usr/local/lib/libggml-cpu*.so . 2>/dev/null || true
# Setup Start Script
COPY ./runpod/start.sh /usr/local/bin/biergarten-start
RUN chmod +x /usr/local/bin/biergarten-start
ENTRYPOINT ["/usr/local/bin/biergarten-start"]

View File

@@ -1,8 +0,0 @@
```bash
touch runpod/start.sh
docker build \
--progress=plain \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```

View File

@@ -1,22 +0,0 @@
name: biergarten-pipeline-live
imageName: biergarten-pipeline:latest
category: NVIDIA
containerDiskInGb: 50
volumeInGb: 50
volumeMountPath: /workspace
dockerEntrypoint:
- /usr/local/bin/biergarten-start
dockerStartCmd: []
isPublic: false
isServerless: false
env:
BIERGARTEN_MODE: live
BIERGARTEN_MODEL_PATH: /workspace/models/google_gemma-4-E4B-it-Q6_K.gguf
BIERGARTEN_PROMPT_DIR: /workspace/app/build/prompts
BIERGARTEN_OUTPUT_DIR: /workspace/output
BIERGARTEN_LOG_PATH: /workspace/logs/pipeline.log
BIERGARTEN_TEMPERATURE: "1.0"
BIERGARTEN_TOP_P: "0.95"
BIERGARTEN_TOP_K: "64"
BIERGARTEN_N_CTX: "8192"
BIERGARTEN_SEED: "-1"

View File

@@ -1,58 +0,0 @@
#!/bin/bash
set -e
MODEL_PATH="${BIERGARTEN_MODEL_PATH:-/workspace/models/google_gemma-4-E4B-it-Q6_K.gguf}"
OUTPUT_DIR="${BIERGARTEN_OUTPUT_DIR:-/workspace/output}"
LOG_PATH="${BIERGARTEN_LOG_PATH:-/workspace/logs/pipeline.log}"
EXECUTABLE="/app/build/biergarten-pipeline"
PROMPT_DIR="/app/prompts"
echo "--- Starting Biergarten Pipeline Environment Check ---"
# Ensure directories exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$(dirname "$LOG_PATH")"
mkdir -p "$(dirname "$MODEL_PATH")"
# Download model if missing
if [ ! -f "$MODEL_PATH" ]; then
echo "Model not found. Downloading (this may take a while)..."
curl -L -C - \
-o "$MODEL_PATH" \
"https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true"
echo "Download complete."
fi
# Verify model exists
if [ ! -f "$MODEL_PATH" ]; then
echo "ERROR: Model still not found after download attempt."
exit 1
fi
# Default GPU layers
GL_LAYERS="${BIERGARTEN_GL_LAYERS:-40}"
# Build args
ARGS=(
"--model" "$MODEL_PATH"
"--prompt-dir" "$PROMPT_DIR"
"--output" "$OUTPUT_DIR"
"--log-path" "$LOG_PATH"
"--n-gpu-layers" "$GL_LAYERS"
)
# Optional params
[[ -n "$BIERGARTEN_TEMPERATURE" ]] && ARGS+=("--temperature" "$BIERGARTEN_TEMPERATURE")
[[ -n "$BIERGARTEN_TOP_P" ]] && ARGS+=("--top-p" "$BIERGARTEN_TOP_P")
[[ -n "$BIERGARTEN_TOP_K" ]] && ARGS+=("--top-k" "$BIERGARTEN_TOP_K")
[[ -n "$BIERGARTEN_N_CTX" ]] && ARGS+=("--n-ctx" "$BIERGARTEN_N_CTX")
[[ -n "$BIERGARTEN_SEED" ]] && ARGS+=("--seed" "$BIERGARTEN_SEED")
# Extra args
[[ -n "$BIERGARTEN_EXTRA_ARGS" ]] && ARGS+=($BIERGARTEN_EXTRA_ARGS)
echo "--- Executing: $EXECUTABLE ${ARGS[*]} ---"
exec "$EXECUTABLE" "${ARGS[@]}"

View File

@@ -1,214 +0,0 @@
#include <chrono>
#include <format>
#include <iostream>
#include <optional>
#include <sstream>
#include <string>
#include "data_model/models.h"
#include "services/logging/logger.h"
std::optional<ApplicationOptions> ParseArguments(
const int argc, char** argv, std::shared_ptr<ILogger> logger) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Defaults sourced from SamplingOptions{} so the CLI and LlamaGenerator
// share a single source of truth — changing the struct updates both.
auto add_sampling_options = [&]() -> void {
const SamplingOptions sampling_defaults{};
opt("temperature",
prog_opts::value<float>()->default_value(sampling_defaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(sampling_defaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(sampling_defaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
opt("n-gpu-layers", prog_opts::value<int>()->default_value(0),
"Number of layers to offload to GPU");
};
// --mocked and --model are mutually exclusive; validation is enforced below
// rather than at registration to produce a clear diagnostic message.
auto add_generator_options = [&]() -> void {
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
};
auto add_pipeline_options = [&]() -> void {
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
opt("location-count", prog_opts::value<uint32_t>()->default_value(10));
};
add_sampling_options();
add_generator_options();
add_pipeline_options();
// No flags provided — treat as a help request rather than an error.
if (argc == 1) {
const std::string title = "Biergarten Pipeline";
const std::string usage = ([&] {
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
return usage_stream.str();
})();
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = title});
logger->Log(LogDTO{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = usage});
}
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = help_stream.str()});
}
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
options.pipeline.location_count = var_map["location-count"].as<uint32_t>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
const int n_gpu_layers = var_map["n-gpu-layers"].as<int>();
// Enforce mutual exclusivity before any further configuration is applied.
if (use_mocked && !model_path.empty()) {
const std::string msg =
"Invalid arguments: --mocked and --model are mutually exclusive";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
const std::string msg =
"Invalid arguments: either --mocked or --model must be specified";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
return std::nullopt;
}
// Prompt directory is only meaningful for live inference — the mock
// generator has no use for it and should not require it to be present.
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
const std::string msg =
"Invalid arguments: --prompt-dir is required when not using --mocked";
if (logger) {
logger->Log({.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
// options.generator.n_gpu_layers = n_gpu_layers;
// Only populate sampling config when the user explicitly overrides at
// least one value. Leaving it as std::nullopt lets LlamaGenerator fall
// back to its own SamplingOptions{} defaults, keeping the two paths
// consistent without redundant copies.
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted() || !var_map["n_gpu_layers"].defaulted();
if (user_provided_sampling) {
// Warn but do not fail — the run is still valid, the flags are just
// silently irrelevant when no model is loaded.
if (use_mocked) {
const std::string msg =
"Sampling parameters are ignored when using --mocked";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Warn,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
} else {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
sampling.n_gpu_layers = var_map["n-gpu-layers"].as<int>();
options.generator.sampling = sampling;
}
}
return options;
} catch (const std::exception& exception) {
const std::string msg =
std::string("Failed to parse command-line arguments: ") +
exception.what();
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
}
return std::nullopt;
} catch (...) {
const std::string msg =
"Failed to parse command-line arguments: unknown error";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
}
return std::nullopt;
}
}

View File

@@ -0,0 +1,16 @@
/**
* @file biergarten_data_generator/biergarten_data_generator.cc
* @brief BiergartenDataGenerator constructor implementation.
*/
#include "biergarten_data_generator.h"
#include <utility>
BiergartenDataGenerator::BiergartenDataGenerator(
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter)
: context_service_(std::move(context_service)),
generator_(std::move(generator)),
exporter_(std::move(exporter)) {}

View File

@@ -0,0 +1,58 @@
/**
* @file biergarten_data_generator/generate_breweries.cc
* @brief BiergartenDataGenerator::GenerateBreweries() implementation.
*/
#include <spdlog/spdlog.h>
#include "biergarten_data_generator.h"
void BiergartenDataGenerator::GenerateBreweries(
std::span<const EnrichedCity> cities) {
spdlog::info("\n=== SAMPLE BREWERY GENERATION ===");
generated_breweries_.clear();
size_t skipped_count = 0;
size_t export_failed_count = 0;
for (const auto& [location, region_context] : cities) {
try {
const BreweryResult brewery =
generator_->GenerateBrewery(location, region_context);
const GeneratedBrewery gen{.location = location, .brewery = brewery};
generated_breweries_.push_back(gen);
try {
exporter_->ProcessRecord(gen);
} catch (const std::exception& export_exception) {
++export_failed_count;
spdlog::warn(
"[Pipeline] Generated brewery for '{}' ({}) but SQLite export "
"failed: {}",
location.city, location.country, export_exception.what());
}
} catch (const std::exception& e) {
++skipped_count;
spdlog::warn(
"[Pipeline] Skipping city '{}' ({}): brewery generation failed: "
"{}",
location.city, location.country, e.what());
}
}
if (skipped_count > 0) {
spdlog::warn("[Pipeline] Skipped {} city/cities due to generation errors",
skipped_count);
}
if (export_failed_count > 0) {
spdlog::warn(
"[Pipeline] Failed to export {} generated brewery/breweries to "
"SQLite",
export_failed_count);
}
}

View File

@@ -0,0 +1,26 @@
/**
* @file biergarten_data_generator/log_results.cc
* @brief BiergartenDataGenerator::LogResults() implementation.
*/
#include <spdlog/spdlog.h>
#include "biergarten_data_generator.h"
void BiergartenDataGenerator::LogResults() const {
spdlog::info("\n=== GENERATED DATA DUMP ===");
size_t index = 1;
for (const auto& [location, brewery] : generated_breweries_) {
spdlog::info(
"{}. city=\"{}\" country=\"{}\" state=\"{}\" "
"iso3166_2={} lat={} lon={}",
index, location.city, location.country, location.state_province,
location.iso3166_2, location.latitude, location.longitude);
spdlog::info(" brewery_name_en=\"{}\"", brewery.name_en);
spdlog::info(" brewery_description_en=\"{}\"", brewery.description_en);
spdlog::info(" brewery_name_local=\"{}\"", brewery.name_local);
spdlog::info(" brewery_description_local=\"{}\"",
brewery.description_local);
++index;
}
}

View File

@@ -0,0 +1,41 @@
/**
* @file biergarten_data_generator/query_cities_with_countries.cc
* @brief BiergartenDataGenerator::QueryCitiesWithCountries() implementation.
*/
#include <spdlog/spdlog.h>
#include <algorithm>
#include <filesystem>
#include <iterator>
#include <random>
#include "biergarten_data_generator.h"
#include "json_handling/json_loader.h"
static constexpr size_t kBreweryAmount = 50;
std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ===");
const std::filesystem::path locations_path = "locations.json";
auto all_locations = JsonLoader::LoadLocations(locations_path);
spdlog::info(" Locations available: {}", all_locations.size());
const size_t sample_count = std::min(kBreweryAmount, all_locations.size());
const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(
sample_count);
std::vector<Location> sampled_locations;
sampled_locations.reserve(sample_count);
std::random_device random_generator;
std::ranges::sample(all_locations, std::back_inserter(sampled_locations),
sample_count_signed, random_generator);
spdlog::info(" Sampled locations: {}", sampled_locations.size());
return sampled_locations;
}

View File

@@ -0,0 +1,52 @@
/**
* @file biergarten_data_generator/run.cc
* @brief BiergartenDataGenerator::Run() implementation.
*/
#include <spdlog/spdlog.h>
#include <utility>
#include "biergarten_data_generator.h"
bool BiergartenDataGenerator::Run() {
try {
exporter_->Initialize();
std::vector<Location> cities = QueryCitiesWithCountries();
std::vector<EnrichedCity> enriched;
enriched.reserve(cities.size());
size_t skipped_count = 0;
for (auto& city : cities) {
try {
std::string region_context = context_service_->GetLocationContext(city);
spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}",
city.city, city.country, region_context);
enriched.push_back(
EnrichedCity{.location = std::move(city),
.region_context = std::move(region_context)});
} catch (const std::exception& exception) {
++skipped_count;
spdlog::warn(
"[Pipeline] Skipping city '{}' ({}): context lookup failed: {}",
city.city, city.country, exception.what());
}
}
if (skipped_count > 0) {
spdlog::warn(
"[Pipeline] Skipped {} city/cities due to context lookup errors",
skipped_count);
}
this->GenerateBreweries(enriched);
exporter_->Finalize();
this->LogResults();
return true;
} catch (const std::exception& e) {
spdlog::error("Pipeline execution failed with error: {}", e.what());
return false;
}
}

View File

@@ -1,20 +0,0 @@
/**
* @file biergarten_pipeline_orchestrator/biergarten_pipeline_orchestrator.cc
* @brief BiergartenDataGenerator constructor implementation.
*/
#include "biergarten_pipeline_orchestrator.h"
#include <utility>
BiergartenPipelineOrchestrator::BiergartenPipelineOrchestrator(
std::shared_ptr<ILogger> logger,
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter,
const ApplicationOptions &app_options)
: logger_(std::move(logger)),
context_service_(std::move(context_service)),
generator_(std::move(generator)),
exporter_(std::move(exporter)),
application_options_(app_options) {}

View File

@@ -1,68 +0,0 @@
/**
* @file biergarten_pipeline_orchestrator/generate_breweries.cc
* @brief BiergartenDataGenerator::GenerateBreweries() implementation.
*/
#include <chrono>
#include <format>
#include "biergarten_pipeline_orchestrator.h"
#include "services/logging/logger.h"
void BiergartenPipelineOrchestrator::GenerateBreweries(
std::span<const EnrichedCity> cities) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = "=== SAMPLE BREWERY GENERATION ==="});
generated_breweries_.clear();
size_t skipped_count = 0;
size_t export_failed_count = 0;
for (const auto& [location, region_context] : cities) {
try {
const BreweryResult brewery =
generator_->GenerateBrewery(location, region_context);
const GeneratedBrewery gen{.location = location, .brewery = brewery};
generated_breweries_.push_back(gen);
try {
exporter_->ProcessRecord(gen);
} catch (const std::exception& export_exception) {
++export_failed_count;
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message =
std::format("[Pipeline] Generated brewery for '{}' ({}) but SQLite export failed: {}",
location.city, location.country, export_exception.what())});
}
} catch (const std::exception& e) {
++skipped_count;
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format("[Pipeline] Skipping city '{}' ({}): brewery generation failed: {}",
location.city, location.country, e.what())});
}
}
if (skipped_count > 0) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format(
"[Pipeline] Skipped {} city/cities due to generation errors",
skipped_count)});
}
if (export_failed_count > 0) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::Teardown,
.message = std::format(
"[Pipeline] Failed to export {} generated brewery/breweries to SQLite",
export_failed_count)});
}
}

View File

@@ -1,37 +0,0 @@
/**
* @file biergarten_pipeline_orchestrator/log_results.cc
* @brief BiergartenDataGenerator::LogResults() implementation.
*/
#include <boost/json/array.hpp>
#include <chrono>
#include <format>
#include "../../includes/json_handling/pretty_print.h"
#include "biergarten_pipeline_orchestrator.h"
#include "services/logging/logger.h"
void BiergartenPipelineOrchestrator::LogResults() const {
boost::json::array output;
for (const auto& [location, brewery] : generated_breweries_) {
output.push_back(boost::json::object{
{"name_en", brewery.name_en},
{"description_en", brewery.description_en},
{"name_local", brewery.name_local},
{"description_local", brewery.description_local},
{"location", boost::json::object{
{"city", location.city},
{"country", location.country},
{"state_province", location.state_province},
{"iso3166_2", location.iso3166_2},
{"latitude", location.latitude},
{"longitude", location.longitude},
}}});
}
std::ostringstream oss;
PrettyPrint(oss, output);
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Teardown,
.message = oss.str()});
}

View File

@@ -1,51 +0,0 @@
/**
* @file biergarten_pipeline_orchestrator/query_cities_with_countries.cc
* @brief BiergartenDataGenerator::QueryCitiesWithCountries() implementation.
*/
#include <algorithm>
#include <chrono>
#include <filesystem>
#include <format>
#include <iterator>
#include <random>
#include "biergarten_pipeline_orchestrator.h"
#include "json_handling/json_loader.h"
#include "services/logging/logger.h"
std::vector<Location>
BiergartenPipelineOrchestrator::QueryCitiesWithCountries() {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "=== GEOGRAPHIC DATA OVERVIEW ==="});
const std::filesystem::path locations_path = "locations.json";
auto all_locations = JsonLoader::LoadLocations(locations_path, logger_);
const size_t sample_count = std::min(
static_cast<size_t>(application_options_.pipeline.location_count),
all_locations.size());
const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(
sample_count);
std::vector<Location> sampled_locations;
sampled_locations.reserve(sample_count);
std::random_device random_generator;
std::ranges::sample(all_locations, std::back_inserter(sampled_locations),
sample_count_signed, random_generator);
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format(" Locations available: {}",
all_locations.size())});
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format(" Sampled locations: {}",
sampled_locations.size())});
return sampled_locations;
}

View File

@@ -1,63 +0,0 @@
/**
* @file biergarten_pipeline_orchestrator/run.cc
* @brief BiergartenDataGenerator::Run() implementation.
*/
#include <chrono>
#include <format>
#include <utility>
#include "biergarten_pipeline_orchestrator.h"
#include "services/logging/logger.h"
bool BiergartenPipelineOrchestrator::Run() {
try {
exporter_->Initialize();
std::vector<Location> cities = QueryCitiesWithCountries();
std::vector<EnrichedCity> enriched;
enriched.reserve(cities.size());
size_t skipped_count = 0;
for (auto& city : cities) {
try {
std::string region_context = context_service_->GetLocationContext(city);
// logger_->Log(LogLevel::Debug, PipelinePhase::UserGeneration,
// "[Pipeline] Context for '" + city.city + "' (" +
// city.iso3166_2 + ") gathered:\n" + region_context);
enriched.push_back(
EnrichedCity{.location = std::move(city),
.region_context = std::move(region_context)});
} catch (const std::exception& exception) {
++skipped_count;
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format(
"[Pipeline] Skipping city '{}' ({}): context lookup failed: {}",
city.city, city.country, exception.what())});
}
}
if (skipped_count > 0) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format(
"[Pipeline] Skipped {} city/cities due to context lookup errors",
skipped_count)});
}
this->GenerateBreweries(enriched);
exporter_->Finalize();
this->LogResults();
return true;
} catch (const std::exception& e) {
logger_->Log(
{.level = LogLevel::Error,
.phase = PipelinePhase::Teardown,
.message =
std::format("Pipeline execution failed with error: {}", e.what())});
return false;
}
}

View File

@@ -4,7 +4,8 @@
* inference, and validates structured JSON output for brewery records. * inference, and validates structured JSON output for brewery records.
*/ */
#include <chrono> #include <spdlog/spdlog.h>
#include <format> #include <format>
#include <optional> #include <optional>
#include <stdexcept> #include <stdexcept>
@@ -99,13 +100,8 @@ BreweryResult LlamaGenerator::GenerateBrewery(
// Generate brewery data from LLM // Generate brewery data from LLM
raw = this->Infer(system_prompt, user_prompt, max_tokens, raw = this->Infer(system_prompt, user_prompt, max_tokens,
kBreweryJsonGrammar); kBreweryJsonGrammar);
if (logger_) { spdlog::debug("LlamaGenerator: raw output (attempt {}): {}", attempt + 1,
logger_->Log( raw);
{.level = LogLevel::Debug,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format("LlamaGenerator: raw output (attempt {}): {}",
attempt + 1, raw)});
}
// Validate output: parse JSON and check required fields // Validate output: parse JSON and check required fields
@@ -116,13 +112,9 @@ BreweryResult LlamaGenerator::GenerateBrewery(
if (!validation_error.has_value()) { if (!validation_error.has_value()) {
// Success: return parsed brewery data // Success: return parsed brewery data
if (logger_) { spdlog::info(
logger_->Log( "LlamaGenerator: successfully generated brewery data on attempt {}",
{.level = LogLevel::Info, attempt + 1);
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format("LlamaGenerator: successfully generated brewery data on attempt {}",
attempt + 1)});
}
return brewery; return brewery;
} }
@@ -130,14 +122,8 @@ BreweryResult LlamaGenerator::GenerateBrewery(
// Validation failed: log error and prepare corrective feedback // Validation failed: log error and prepare corrective feedback
last_error = *validation_error; last_error = *validation_error;
if (logger_) { spdlog::warn("LlamaGenerator: malformed brewery JSON (attempt {}): {}",
logger_->Log( attempt + 1, *validation_error);
{.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message =
std::format("LlamaGenerator: malformed brewery JSON (attempt {}): {}",
attempt + 1, *validation_error)});
}
// Update prompt with error details to guide LLM toward correct output. // Update prompt with error details to guide LLM toward correct output.
user_prompt = std::format( user_prompt = std::format(
@@ -154,13 +140,9 @@ BreweryResult LlamaGenerator::GenerateBrewery(
} }
// All retry attempts exhausted: log failure and throw exception // All retry attempts exhausted: log failure and throw exception
if (logger_) { spdlog::error(
logger_->Log( "LlamaGenerator: malformed brewery response after {} attempts: "
{.level = LogLevel::Error, "{}",
.phase = PipelinePhase::BreweryAndBeerGeneration, max_attempts, last_error.empty() ? raw : last_error);
.message = std::format(
"LlamaGenerator: malformed brewery response after {} attempts: {}",
max_attempts, last_error.empty() ? raw : last_error)});
}
throw std::runtime_error("LlamaGenerator: malformed brewery response"); throw std::runtime_error("LlamaGenerator: malformed brewery response");
} }

View File

@@ -4,8 +4,9 @@
* retry handling, and output sanitization for downstream parsing. * retry handling, and output sanitization for downstream parsing.
*/ */
#include <spdlog/spdlog.h>
#include <format> #include <stdexcept>
#include <string> #include <string>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
@@ -20,5 +21,5 @@
// 4. Return locale-aware username and biography // 4. Return locale-aware username and biography
UserResult LlamaGenerator::GenerateUser(const std::string& locale) { UserResult LlamaGenerator::GenerateUser(const std::string& locale) {
return {.username = "test_user", return {.username = "test_user",
.bio = std::format("This is a test user profile from {}.", locale)}; .bio = "This is a test user profile from " + locale + "."};
} }

View File

@@ -16,11 +16,11 @@
#include "data_generation/llama_generator_helpers.h" #include "data_generation/llama_generator_helpers.h"
#include "llama.h" #include "llama.h"
namespace {
/** /**
* String trimming: removes leading and trailing whitespace * String trimming: removes leading and trailing whitespace
*/ */
std::string Trim(std::string_view value) { static std::string Trim(std::string_view value) {
constexpr std::string_view whitespace = " \t\n\r\f\v"; constexpr std::string_view whitespace = " \t\n\r\f\v";
const size_t first_index = value.find_first_not_of(whitespace); const size_t first_index = value.find_first_not_of(whitespace);
if (first_index == std::string_view::npos) { if (first_index == std::string_view::npos) {
@@ -35,7 +35,7 @@ std::string Trim(std::string_view value) {
* Normalize whitespace: collapses multiple spaces/tabs/newlines into single * Normalize whitespace: collapses multiple spaces/tabs/newlines into single
* spaces * spaces
*/ */
std::string CondenseWhitespace(std::string_view text) { static std::string CondenseWhitespace(std::string_view text) {
std::string out; std::string out;
out.reserve(text.size()); out.reserve(text.size());
@@ -61,37 +61,7 @@ std::string CondenseWhitespace(std::string_view text) {
// Guard against truncating in the first half of the string. // Guard against truncating in the first half of the string.
// This preserves the critical opening content and avoids cutting critical // This preserves the critical opening content and avoids cutting critical
// context words early in the region description. // context words early in the region description.
constexpr size_t kTruncationGuardDivisor = 2; static constexpr size_t kTruncationGuardDivisor = 2;
bool ReadRequiredTrimmedStringField(const boost::json::object& obj,
std::string_view key, std::string& out,
std::string* error_out) {
const boost::json::value* field = obj.if_contains(key);
if (field == nullptr || !field->is_string()) {
return false;
}
const auto& string_value = field->as_string();
out = Trim(std::string_view(string_value.data(), string_value.size()));
return !out.empty();
}
bool HasSchemaPlaceholder(const std::array<std::string*, 4>& values) {
for (const std::string* value : values) {
std::string lowered = *value;
std::ranges::transform(lowered, lowered.begin(),
[](const unsigned char character) {
return static_cast<char>(std::tolower(character));
});
if (lowered == "string") {
return true;
}
}
return false;
}
} // namespace
/** /**
* Truncate region context to fit within max length while preserving word * Truncate region context to fit within max length while preserving word
@@ -151,6 +121,47 @@ void AppendTokenPiece(const llama_vocab* vocab, llama_token token,
"LlamaGenerator: failed to decode sampled token piece"); "LlamaGenerator: failed to decode sampled token piece");
} }
static bool ReadRequiredTrimmedStringField(const boost::json::object& obj,
std::string_view key,
std::string& out,
std::string* error_out) {
const boost::json::value* field = obj.if_contains(key);
if (field == nullptr || !field->is_string()) {
if (error_out != nullptr) {
*error_out =
"JSON field '" + std::string(key) + "' is missing or not a string";
}
return false;
}
const auto& string_value = field->as_string();
out = Trim(std::string_view(string_value.data(), string_value.size()));
if (out.empty()) {
if (error_out != nullptr) {
*error_out = "JSON field '" + std::string(key) + "' must not be empty";
}
return false;
}
return true;
}
static bool HasSchemaPlaceholder(const std::array<std::string*, 4>& values) {
for (const std::string* value : values) {
std::string lowered = *value;
std::ranges::transform(lowered, lowered.begin(),
[](unsigned char character) {
return static_cast<char>(std::tolower(character));
});
if (lowered == "string") {
return true;
}
}
return false;
}
std::optional<std::string> ValidateBreweryJson(const std::string& raw, std::optional<std::string> ValidateBreweryJson(const std::string& raw,
BreweryResult& brewery_out) { BreweryResult& brewery_out) {
boost::system::error_code error_code; boost::system::error_code error_code;
@@ -198,7 +209,7 @@ std::optional<std::string> ValidateBreweryJson(const std::string& raw,
return validation_error; return validation_error;
} }
const std::array schema_placeholders = { const std::array<std::string*, 4> schema_placeholders = {
&brewery_out.name_en, &brewery_out.description_en, &brewery_out.name_en, &brewery_out.description_en,
&brewery_out.name_local, &brewery_out.description_local}; &brewery_out.name_local, &brewery_out.description_local};
if (HasSchemaPlaceholder(schema_placeholders)) { if (HasSchemaPlaceholder(schema_placeholders)) {

View File

@@ -5,9 +5,9 @@
* output tokens back to text for system+user chat prompts. * output tokens back to text for system+user chat prompts.
*/ */
#include <spdlog/spdlog.h>
#include <algorithm> #include <algorithm>
#include <chrono>
#include <format>
#include <memory> #include <memory>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
@@ -107,7 +107,7 @@ std::string LlamaGenerator::InferFormatted(const std::string& formatted_prompt,
.top_p = sampling_top_p_, .top_p = sampling_top_p_,
.seed = static_cast<uint32_t>(rng_()), .seed = static_cast<uint32_t>(rng_()),
}; };
const auto sampler = MakeSamplerChain(vocab, sampler_config, grammar); auto sampler = MakeSamplerChain(vocab, sampler_config, grammar);
/** /**
* Clear KV cache to ensure clean inference state (no residual context) * Clear KV cache to ensure clean inference state (no residual context)
@@ -171,14 +171,10 @@ std::string LlamaGenerator::InferFormatted(const std::string& formatted_prompt,
*/ */
prompt_tokens.resize(static_cast<size_t>(token_count)); prompt_tokens.resize(static_cast<size_t>(token_count));
if (token_count > prompt_budget) { if (token_count > prompt_budget) {
if (logger_) { spdlog::warn(
logger_->Log({.level = LogLevel::Warn, "LlamaGenerator: prompt too long ({} tokens), truncating to {} "
.phase = PipelinePhase::BreweryAndBeerGeneration, "tokens to fit n_batch/n_ctx limits",
.message = std::format( token_count, prompt_budget);
"LlamaGenerator: prompt too long ({} tokens), "
"truncating to {} tokens to fit n_batch/n_ctx limits",
token_count, prompt_budget)});
}
prompt_tokens.resize(static_cast<size_t>(prompt_budget)); prompt_tokens.resize(static_cast<size_t>(prompt_budget));
token_count = prompt_budget; token_count = prompt_budget;
} }

View File

@@ -11,7 +11,7 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "data_model/models.h" #include "data_model/application_options.h"
#include "llama.h" #include "llama.h"
static constexpr uint32_t kMaxContextSize = 32768U; static constexpr uint32_t kMaxContextSize = 32768U;
@@ -32,11 +32,9 @@ void LlamaGenerator::ContextDeleter::operator()(
LlamaGenerator::LlamaGenerator( LlamaGenerator::LlamaGenerator(
const ApplicationOptions& options, const std::string& model_path, const ApplicationOptions& options, const std::string& model_path,
std::shared_ptr<ILogger> logger,
std::unique_ptr<IPromptFormatter> prompt_formatter, std::unique_ptr<IPromptFormatter> prompt_formatter,
std::unique_ptr<IPromptDirectory> prompt_directory) std::unique_ptr<IPromptDirectory> prompt_directory)
: rng_(std::random_device{}()), : rng_(std::random_device{}()),
logger_(std::move(logger)),
prompt_formatter_(std::move(prompt_formatter)), prompt_formatter_(std::move(prompt_formatter)),
prompt_directory_(std::move(prompt_directory)) { prompt_directory_(std::move(prompt_directory)) {
if (model_path.empty()) { if (model_path.empty()) {
@@ -91,7 +89,6 @@ LlamaGenerator::LlamaGenerator(
} }
n_ctx_ = sampling.n_ctx; n_ctx_ = sampling.n_ctx;
n_gpu_layers_ = sampling.n_gpu_layers;
this->Load(model_path); this->Load(model_path);
} }

View File

@@ -4,14 +4,14 @@
* context, and resets prior resources during model initialization. * context, and resets prior resources during model initialization.
*/ */
#include <spdlog/spdlog.h>
#include <algorithm> #include <algorithm>
#include <chrono>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include <utility> #include <utility>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "ggml-backend.h"
#include "llama.h" #include "llama.h"
// Maximum batch size for decode operations. Capping the batch prevents // Maximum batch size for decode operations. Capping the batch prevents
@@ -22,16 +22,9 @@ void LlamaGenerator::Load(const std::string& model_path) {
context_.reset(); context_.reset();
model_.reset(); model_.reset();
// Specifically load dynamic ggml backends (like CUDA) that are provided const llama_model_params model_params = llama_model_default_params();
// externally before attempting to load a model. LlamaGenerator::ModelHandle loaded_model(
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers_;
ModelHandle loaded_model(
llama_model_load_from_file(model_path.c_str(), model_params)); llama_model_load_from_file(model_path.c_str(), model_params));
if (!loaded_model) { if (!loaded_model) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: failed to load model from path: " + model_path); "LlamaGenerator: failed to load model from path: " + model_path);
@@ -41,9 +34,8 @@ void LlamaGenerator::Load(const std::string& model_path) {
context_params.n_ctx = n_ctx_; context_params.n_ctx = n_ctx_;
context_params.n_batch = std::min(n_ctx_, kMaxBatchSize); context_params.n_batch = std::min(n_ctx_, kMaxBatchSize);
ContextHandle loaded_context( LlamaGenerator::ContextHandle loaded_context(
llama_init_from_model(loaded_model.get(), context_params)); llama_init_from_model(loaded_model.get(), context_params));
if (!loaded_context) { if (!loaded_context) {
throw std::runtime_error("LlamaGenerator: failed to create context"); throw std::runtime_error("LlamaGenerator: failed to create context");
} }
@@ -51,10 +43,5 @@ void LlamaGenerator::Load(const std::string& model_path) {
model_ = std::move(loaded_model); model_ = std::move(loaded_model);
context_ = std::move(loaded_context); context_ = std::move(loaded_context);
if (logger_) { spdlog::info("[LlamaGenerator] Loaded model: {}", model_path);
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format("[LlamaGenerator] Loaded model: {} ",
model_path)});
}
} }

View File

@@ -6,9 +6,7 @@
#include "json_handling/json_loader.h" #include "json_handling/json_loader.h"
#include <format> #include <spdlog/spdlog.h>
#include "services/logging/logger.h"
#include <iostream>
#include <boost/json.hpp> #include <boost/json.hpp>
#include <fstream> #include <fstream>
@@ -21,8 +19,8 @@ static std::string ReadRequiredString(const boost::json::object& object,
const char* key) { const char* key) {
const boost::json::value* value = object.if_contains(key); const boost::json::value* value = object.if_contains(key);
if (value == nullptr || !value->is_string()) { if (value == nullptr || !value->is_string()) {
throw std::runtime_error( throw std::runtime_error(std::string("Missing or invalid string field: ") +
std::format("Missing or invalid string field: {}", key)); key);
} }
const std::string_view text = value->as_string(); const std::string_view text = value->as_string();
return std::string(text); return std::string(text);
@@ -32,8 +30,8 @@ static double ReadRequiredNumber(const boost::json::object& object,
const char* key) { const char* key) {
const boost::json::value* value = object.if_contains(key); const boost::json::value* value = object.if_contains(key);
if (value == nullptr || !value->is_number()) { if (value == nullptr || !value->is_number()) {
throw std::runtime_error( throw std::runtime_error(std::string("Missing or invalid numeric field: ") +
std::format("Missing or invalid numeric field: {}", key)); key);
} }
return value->to_number<double>(); return value->to_number<double>();
} }
@@ -43,7 +41,7 @@ static std::vector<std::string> ReadRequiredStringArray(
const boost::json::value* value = object.if_contains(key); const boost::json::value* value = object.if_contains(key);
if (value == nullptr || !value->is_array()) { if (value == nullptr || !value->is_array()) {
throw std::runtime_error( throw std::runtime_error(
std::format("Missing or invalid string array field: {}", key)); std::string("Missing or invalid string array field: ") + key);
} }
const auto& array = value->as_array(); const auto& array = value->as_array();
@@ -52,7 +50,7 @@ static std::vector<std::string> ReadRequiredStringArray(
for (const auto& item : array) { for (const auto& item : array) {
if (!item.is_string()) { if (!item.is_string()) {
throw std::runtime_error( throw std::runtime_error(
std::format("Missing or invalid string array field: {}", key)); std::string("Missing or invalid string array field: ") + key);
} }
items.emplace_back(item.as_string()); items.emplace_back(item.as_string());
} }
@@ -60,7 +58,7 @@ static std::vector<std::string> ReadRequiredStringArray(
} }
std::vector<Location> JsonLoader::LoadLocations( std::vector<Location> JsonLoader::LoadLocations(
const std::filesystem::path& filepath, std::shared_ptr<ILogger> logger) { const std::filesystem::path& filepath) {
std::ifstream input(filepath); std::ifstream input(filepath);
if (!input.is_open()) { if (!input.is_open()) {
throw std::runtime_error("Failed to open locations file: " + throw std::runtime_error("Failed to open locations file: " +
@@ -106,5 +104,7 @@ std::vector<Location> JsonLoader::LoadLocations(
}); });
} }
spdlog::info("[JsonLoader] Loaded {} locations from {}", locations.size(),
filepath.string());
return locations; return locations;
} }

View File

@@ -4,77 +4,184 @@
* initializes shared infrastructure, and executes the pipeline entry flow. * initializes shared infrastructure, and executes the pipeline entry flow.
*/ */
#include <spdlog/fmt/fmt.h>
#include <spdlog/spdlog.h> #include <spdlog/spdlog.h>
#include <boost/di.hpp> #include <boost/di.hpp>
#include <boost/program_options.hpp> #include <boost/program_options.hpp>
#include <chrono> #include <chrono>
#include <cstdint>
#include <exception> #include <exception>
#include <format>
#include <iostream>
#include <memory> #include <memory>
#include <optional> #include <optional>
#include <sstream>
#include <string> #include <string>
#include <thread>
#include "biergarten_pipeline_orchestrator.h" #include "biergarten_data_generator.h"
#include "concurrency/bounded_channel.h"
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "data_generation/mock_generator.h" #include "data_generation/mock_generator.h"
#include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h" #include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h"
#include "data_model/models.h" #include "data_model/application_options.h"
#include "llama_backend_state.h" #include "llama_backend_state.h"
#include "services/database/export_service.h" #include "services/enrichment_service.h"
#include "services/database/sqlite_export_service.h" #include "services/export_service.h"
#include "services/datetime/timer.h" #include "services/prompt_directory.h"
#include "services/enrichment/enrichment_service.h" #include "services/sqlite_export_service.h"
#include "services/enrichment/mock_enrichment.h" #include "services/wikipedia_service.h"
#include "services/enrichment/wikipedia_service.h" #include "web_client/curl_web_client.h"
#include "services/logging/log_dispatcher.h"
#include "services/logging/log_entry.h"
#include "services/logging/log_producer.h"
#include "services/logging/logger.h"
#include "services/prompting/prompt_directory.h"
#include "web_client/http_web_client.h"
namespace prog_opts = boost::program_options;
namespace di = boost::di; namespace di = boost::di;
static constexpr size_t kLogMaxCount = 512; /**
* @brief Parse command-line arguments into ApplicationOptions.
*
* @param argc Command-line argument count.
* @param argv Command-line arguments.
* @return Parsed ApplicationOptions if parsing succeeded, std::nullopt
* otherwise.
*/
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
int main(const int argc, char** argv) { auto opt = desc.add_options();
spdlog::set_level(spdlog::level::debug);
spdlog::set_pattern("│ %Y-%m-%d %H:%M:%S.%e │ %^%-7l%$ │ %v");
BoundedChannel<LogEntry> log_channel(kLogMaxCount);
auto log_dispatcher = // opt("help,h", "Produce help message");
std::make_unique<LogDispatcher>(log_channel);
std::shared_ptr<ILogger> log_producer =
std::make_shared<LogProducer>(log_channel);
std::thread log_thread([&log_dispatcher] { log_dispatcher->Run(); }); // Generator Options
auto shutdown = [&](const int exit_code) { opt("mocked", prog_opts::bool_switch(),
log_channel.Close(); "Use mocked generator for brewery/user data");
log_thread.join(); opt("model,m", prog_opts::value<std::string>()->default_value(""),
return exit_code; "Path to LLM model (gguf)");
};
// Sampling Options - defaults driven from SamplingOptions struct
const SamplingOptions kSamplingDefaults{};
opt("temperature",
prog_opts::value<float>()->default_value(kSamplingDefaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(kSamplingDefaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(kSamplingDefaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(kSamplingDefaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(kSamplingDefaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
// Pipeline Options
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: Either --mocked or --model must be specified");
return std::nullopt;
}
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
spdlog::error(
"Invalid arguments: --prompt-dir is required when not using "
"--mocked");
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted();
if (use_mocked) {
if (user_provided_sampling) {
spdlog::warn("Sampling parameters are ignored when using --mocked");
}
} else if (user_provided_sampling) {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
options.generator.sampling = sampling;
}
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}
struct Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
};
int main(const int argc, char** argv) {
try { try {
Timer timer; Timer timer;
const CurlGlobalState curl_state;
#ifndef BIERGARTEN_MOCK_ONLY
const LlamaBackendState llama_backend_state; const LlamaBackendState llama_backend_state;
#endif spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v");
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "STARTING PIPELINE"});
const std::optional<ApplicationOptions> parsed_options =
ParseArguments(argc, argv, log_producer);
const auto parsed_options = ParseArguments(argc, argv);
if (!parsed_options.has_value()) { if (!parsed_options.has_value()) {
return shutdown(EXIT_FAILURE); return 0;
} }
const auto options = *parsed_options; const auto options = *parsed_options;
@@ -83,136 +190,55 @@ int main(const int argc, char** argv) {
options.generator.sampling.value_or(SamplingOptions{}); options.generator.sampling.value_or(SamplingOptions{});
std::unique_ptr<IPromptDirectory> prompt_directory; std::unique_ptr<IPromptDirectory> prompt_directory;
if (!options.generator.use_mocked) { if (!options.generator.use_mocked) {
try { try {
prompt_directory = std::make_unique<PromptDirectory>( prompt_directory =
options.pipeline.prompt_dir, log_producer); std::make_unique<PromptDirectory>(options.pipeline.prompt_dir);
} catch (const std::exception& dir_error) { } catch (const std::exception& dir_error) {
log_producer->Log({.level = LogLevel::Error, spdlog::error("[Startup] Invalid --prompt-dir: {}", dir_error.what());
.phase = PipelinePhase::Startup, return 1;
.message = std::format("Invalid --prompt-dir: {}",
dir_error.what())});
return shutdown(EXIT_FAILURE);
} }
} }
const auto injector = di::make_injector( const auto injector = di::make_injector(
di::bind<ILogger>().to(log_producer), di::bind<WebClient>().to<CURLWebClient>(),
di::bind<ApplicationOptions>().to(options), di::bind<ApplicationOptions>().to(options),
di::bind<std::string>().to(model_path), di::bind<IEnrichmentService>().to<WikipediaService>(),
di::bind<IExportService>().to<SqliteExportService>(), di::bind<IExportService>().to<SqliteExportService>(),
di::bind<IPromptFormatter>().to([options, log_producer] { di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(),
if (options.generator.use_mocked) { di::bind<std::string>().to(model_path),
{
log_producer->Log(
{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Prompt formatter: none (mock mode)"});
}
return std::unique_ptr<IPromptFormatter>(nullptr);
}
{
log_producer->Log(
{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Prompt formatter: Gemma4JinjaPromptFormatter"});
}
return std::unique_ptr<IPromptFormatter>(
std::make_unique<Gemma4JinjaPromptFormatter>());
}),
di::bind<WebClient>().to([options, log_producer] {
if (options.generator.use_mocked) {
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Web client: none (mock mode)"});
}
return std::unique_ptr<WebClient>(nullptr);
}
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Web client: HttpWebClient"});
}
return std::unique_ptr<WebClient>(
std::make_unique<HttpWebClient>(log_producer));
}),
di::bind<IEnrichmentService>().to(
[options, &log_producer](
const auto& inj) -> std::unique_ptr<IEnrichmentService> {
if (options.generator.use_mocked) {
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Enrichment: mock"});
}
return std::make_unique<MockEnrichmentService>();
}
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Enrichment: Wikipedia"});
}
return std::make_unique<WikipediaEnrichmentService>(
inj.template create<std::unique_ptr<WebClient>>(),
log_producer);
}),
di::bind<DataGenerator>().to( di::bind<DataGenerator>().to(
[&options, &model_path, &sampling, &prompt_directory, [options, model_path, sampling, &prompt_directory](
&log_producer](const auto& inj) -> std::unique_ptr<DataGenerator> { const auto& inj) -> std::unique_ptr<DataGenerator> {
if (options.generator.use_mocked) { if (options.generator.use_mocked) {
{ spdlog::info(
log_producer->Log({.level = LogLevel::Info, "[Generator] Using MockGenerator (no model path provided)");
.phase = PipelinePhase::Startup,
.message = "Generator: mock"});
}
return std::make_unique<MockGenerator>(); return std::make_unique<MockGenerator>();
} }
{
log_producer->Log( spdlog::info(
{.level = LogLevel::Info, "[Generator] Using LlamaGenerator: {} (temperature={}, "
.phase = PipelinePhase::Startup, "top-p={}, top-k={}, n_ctx={}, seed={})",
.message = std::format(
"Generator: LlamaGenerator | model={} | temp={:.2f} "
"top_p={:.2f} top_k={} n_ctx={} seed={}",
model_path, sampling.temperature, sampling.top_p, model_path, sampling.temperature, sampling.top_p,
sampling.top_k, sampling.n_ctx, sampling.seed)}); sampling.top_k, sampling.n_ctx, sampling.seed);
}
return std::make_unique<LlamaGenerator>( return std::make_unique<LlamaGenerator>(
options, model_path, log_producer, options, model_path,
inj.template create<std::unique_ptr<IPromptFormatter>>(), inj.template create<std::unique_ptr<IPromptFormatter>>(),
std::move(prompt_directory)); std::move(prompt_directory));
})); }));
const auto orchestrator = auto generator =
injector.create<std::unique_ptr<BiergartenPipelineOrchestrator>>(); injector.create<std::unique_ptr<BiergartenDataGenerator>>();
if (!orchestrator->Run()) { if (!generator->Run()) {
log_producer->Log({.level = LogLevel::Error, spdlog::error("Pipeline execution failed");
.phase = PipelinePhase::Teardown, return 1;
.message = "Pipeline execution failed"});
return shutdown(EXIT_FAILURE);
} }
log_producer->Log({.level = LogLevel::Info, spdlog::info("Pipeline executed successfully in {} ms", timer.Elapsed());
.phase = PipelinePhase::Teardown, return 0;
.message = std::format("Pipeline complete in {} ms",
timer.Elapsed())});
return shutdown(EXIT_SUCCESS);
} catch (const std::exception& exception) { } catch (const std::exception& exception) {
const LogDTO log_entry{.level = LogLevel::Error, spdlog::critical("Unhandled fatal error in main: {}", exception.what());
.phase = PipelinePhase::Teardown, return 1;
.message = exception.what()};
if (log_producer) {
log_producer->Log(log_entry);
} else {
std::cerr << log_entry.message << std::endl;
}
return shutdown(EXIT_FAILURE);
} }
} }

View File

@@ -1,160 +0,0 @@
/**
* @file wikipedia/fetch_extract.cc
*/
#include <boost/json.hpp>
#include <chrono>
#include <format>
#include <string>
#include <string_view>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
using namespace boost;
std::string WikipediaEnrichmentService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
// 1. Cache Lookup
if (const auto cache_it = this->extract_cache_.find(cache_key);
cache_it != this->extract_cache_.end()) {
if (logger_) {
logger_->Log({.level = LogLevel::Debug,
.phase = PipelinePhase::UserGeneration,
.message = std::format("Wikipedia: Cache hit for {}!", cache_key)});
}
return cache_it->second;
}
const std::string encoded = this->client_->EncodeURL(cache_key);
const std::string url = std::format(
"https://en.wikipedia.org/w/"
"api.php?action=query&titles={}&prop=extracts&explaintext=1&format=json",
encoded);
const std::string body = this->client_->Get(url);
{
using namespace std::literals::chrono_literals;
std::this_thread::sleep_for(1s);
}
// 2. Parse JSON
system::error_code ec;
json::value doc = json::parse(body, ec);
if (ec) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: JSON parse error for '{}': {}",
std::string(query), ec.message())});
}
return {};
}
// 3. Safe Extraction
const json::object* obj = doc.if_object();
if (obj == nullptr) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: Expected root object for '{}'",
std::string(query))});
}
return {};
}
const json::value* query_ptr = obj->if_contains("query");
const json::value* pages_ptr =
((query_ptr != nullptr) && query_ptr->is_object())
? query_ptr->get_object().if_contains("pages")
: nullptr;
if ((pages_ptr == nullptr) || !pages_ptr->is_object()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: Missing query.pages for '{}'",
std::string(query))});
}
return {};
}
const json::object& pages = pages_ptr->get_object();
if (pages.empty()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: No pages returned for '{}'",
std::string(query))});
}
this->extract_cache_.emplace(cache_key, "");
return {};
}
// Wikipedia returns the page under a dynamic ID key; we just want the first
// one
const json::value& page_val = pages.begin()->value();
if (!page_val.is_object()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: Unexpected page format for '{}'",
std::string(query))});
}
return {};
}
const json::object& page = page_val.get_object();
// Handle 404/Missing status
if (page.contains("missing")) {
if (logger_) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: Page '{}' does not exist",
std::string(query))});
}
this->extract_cache_.emplace(cache_key, "");
return {};
}
const json::value* extract_ptr = page.if_contains("extract");
if ((extract_ptr == nullptr) || !extract_ptr->is_string()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: No extract string found for '{}'",
std::string(query))});
}
this->extract_cache_.emplace(cache_key, "");
return {};
}
// 4. Success
std::string extract(extract_ptr->as_string());
if (logger_) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: Fetched {} chars for '{}'",
extract.size(), std::string(query))});
}
this->extract_cache_.insert_or_assign(cache_key, extract);
return extract;
}

View File

@@ -1,70 +0,0 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <chrono>
#include <format>
#include <string>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
std::string WikipediaEnrichmentService::GetLocationContext(
const Location& loc) {
using namespace std::literals::chrono_literals;
if (!this->client_) {
if (logger_) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = "Wikipedia client is nullptr."});
}
return {};
}
std::string result;
// std::string region_query(loc.city);
// if (!loc.country.empty()) {
// region_query += loc.state_province,
// region_query += ", ";
// region_query += loc.country;
// }
constexpr std::string_view brewing_query = "brewing";
const std::string location_query =
std::format("{}, {}", loc.city, loc.iso3166_2);
const std::string beer_query = std::format("beer in {}", loc.country);
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(brewing_query));
append_extract(FetchExtract(beer_query));
if (logger_) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::UserGeneration,
.message = std::format("Done fetching for {}. Sleeping for 10 seconds.",
location_query)});
}
std::this_thread::sleep_for(10s);
} catch (const std::runtime_error& e) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Debug,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService lookup failed for '{}': {}",
location_query, e.what())});
}
}
return result;
}

View File

@@ -1,12 +0,0 @@
/**
* @file services/wikipedia/wikipedia_service.cc
* @brief WikipediaService constructor implementation.
*/
#include "services/enrichment/wikipedia_service.h"
#include <utility>
WikipediaEnrichmentService::WikipediaEnrichmentService(
std::unique_ptr<WebClient> client, std::shared_ptr<ILogger> logger)
: client_(std::move(client)), logger_(std::move(logger)) {}

View File

@@ -1,74 +0,0 @@
/**
* @brief LogDispatcher implementation for asynchronous pipeline logging.
*
* LogDispatcher drains LogEntry items from a BoundedChannel and forwards them
* to spdlog for final output.
*/
#include "services/logging/log_dispatcher.h"
#include <spdlog/spdlog.h>
#include <string>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
namespace {
[[nodiscard]] constexpr std::string_view PipelinePhaseToString(
PipelinePhase phase) {
switch (phase) {
case PipelinePhase::Startup:
return "Startup";
case PipelinePhase::UserGeneration:
return "User Generation";
case PipelinePhase::BreweryAndBeerGeneration:
return "Brewery & Beer Gen";
case PipelinePhase::CheckinGeneration:
return "Checkin Gen";
case PipelinePhase::RatingGeneration:
return "Rating Gen";
case PipelinePhase::FollowGeneration:
return "Follow Gen";
case PipelinePhase::Teardown:
return "Teardown";
}
return "Unknown";
}
} // namespace
LogDispatcher::LogDispatcher(BoundedChannel<LogEntry>& channel)
: channel_(channel) {}
void LogDispatcher::Run() {
auto logger = spdlog::default_logger();
while (true) {
auto entry = channel_.Receive();
if (!entry.has_value()) {
// Channel is closed and drained.
break;
}
const auto& log = entry.value();
logger->log(ToSpdlogLevel(log.level),
"{:<20} │ thread: {:016x} │ [{}:{}] │ {}",
PipelinePhaseToString(log.phase),
std::hash<std::thread::id>{}(log.thread_id),
log.origin.file_name(), log.origin.line(), log.message);
}
}
spdlog::level::level_enum LogDispatcher::ToSpdlogLevel(LogLevel level) {
switch (level) {
case LogLevel::Debug:
return spdlog::level::debug;
case LogLevel::Info:
return spdlog::level::info;
case LogLevel::Warn:
return spdlog::level::warn;
case LogLevel::Error:
return spdlog::level::err;
}
return spdlog::level::info;
}

View File

@@ -1,19 +0,0 @@
/**
* @file src/services/logging/log_producer.cc
* @brief LogProducer implementation for asynchronous pipeline logging.
*/
#include "services/logging/log_producer.h"
#include <chrono>
#include <optional>
#include <string>
#include <string_view>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
LogProducer::LogProducer(BoundedChannel<LogEntry>& channel)
: channel_(channel) {}
void LogProducer::DoLog(LogEntry entry) { channel_.Send(std::move(entry)); }

View File

@@ -4,27 +4,22 @@
* construction and loads named prompt files on demand with in-process caching. * construction and loads named prompt files on demand with in-process caching.
*/ */
#include "services/prompting/prompt_directory.h" #include "services/prompt_directory.h"
#include <spdlog/spdlog.h>
#include <chrono>
#include <filesystem> #include <filesystem>
#include <format>
#include <fstream> #include <fstream>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include <string_view> #include <string_view>
#include <utility>
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
// PromptDirectory // PromptDirectory
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir) PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir)
: PromptDirectory(prompt_dir, nullptr) {} : prompt_dir_(prompt_dir) {
PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir,
std::shared_ptr<ILogger> logger)
: prompt_dir_(prompt_dir), logger_(std::move(logger)) {
std::error_code ec; std::error_code ec;
// Scenario 4: directory must exist. // Scenario 4: directory must exist.
@@ -45,18 +40,12 @@ PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir,
std::filesystem::directory_iterator probe(prompt_dir_, ec); std::filesystem::directory_iterator probe(prompt_dir_, ec);
if (ec) { if (ec) {
throw std::runtime_error( throw std::runtime_error(
std::format("PromptDirectory: prompt directory is not readable: {} ({})", "PromptDirectory: prompt directory is not readable: " +
prompt_dir_.string(), ec.message())); prompt_dir_.string() + " (" + ec.message() + ")");
} }
if (logger_) { spdlog::info("[PromptDirectory] Resolved prompt directory: {}",
logger_->Log( prompt_dir_.string());
{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message =
std::string("[PromptDirectory] Resolved prompt directory: ") +
prompt_dir_.string()});
}
} }
std::string PromptDirectory::Load(std::string_view key) { std::string PromptDirectory::Load(std::string_view key) {
@@ -70,13 +59,13 @@ std::string PromptDirectory::Load(std::string_view key) {
// Scenario 3: resolve <prompt_dir>/<key>.md and require it to exist. // Scenario 3: resolve <prompt_dir>/<key>.md and require it to exist.
const std::filesystem::path file_path = const std::filesystem::path file_path =
prompt_dir_ / std::filesystem::path(std::format("{}.md", key_str)); prompt_dir_ / std::filesystem::path(key_str + ".md");
std::ifstream file(file_path); std::ifstream file(file_path);
if (!file.is_open()) { if (!file.is_open()) {
throw std::runtime_error( throw std::runtime_error(
std::format("PromptDirectory: prompt file not found for key '{}': {}", "PromptDirectory: prompt file not found for key '" + key_str +
key_str, file_path.string())); "': " + file_path.string());
} }
std::string content((std::istreambuf_iterator<char>(file)), std::string content((std::istreambuf_iterator<char>(file)),
@@ -84,16 +73,12 @@ std::string PromptDirectory::Load(std::string_view key) {
file.close(); file.close();
if (content.empty()) { if (content.empty()) {
throw std::runtime_error(std::format("PromptDirectory: prompt file for key '{}' is empty: {}", throw std::runtime_error("PromptDirectory: prompt file for key '" +
key_str, file_path.string())); key_str + "' is empty: " + file_path.string());
} }
if (logger_) { spdlog::info("[PromptDirectory] Loaded prompt '{}' from '{}' ({} chars)",
logger_->Log({.level = LogLevel::Info, key_str, file_path.string(), content.size());
.phase = PipelinePhase::Startup,
.message = std::format("[PromptDirectory] Loaded prompt '{}' from '{}' ({} chars)",
key_str, file_path.string(), content.size())});
}
cache_.emplace(key_str, content); cache_.emplace(key_str, content);
return content; return content;

View File

@@ -0,0 +1,23 @@
/**
* @file services/sqlite/build_database_path.cc
* @brief SqliteExportService::BuildDatabasePath() implementation.
*/
#include <filesystem>
#include <string>
#include "services/sqlite_export_service.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}

View File

@@ -5,8 +5,8 @@
#include <stdexcept> #include <stdexcept>
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
void SqliteExportService::Finalize() { void SqliteExportService::Finalize() {
if (db_handle_ == nullptr) { if (db_handle_ == nullptr) {

View File

@@ -1,6 +1,5 @@
#include "services/database/sqlite_connection_helpers.h" #include "services/sqlite_connection_helpers.h"
#include <format>
#include <stdexcept> #include <stdexcept>
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -21,7 +20,7 @@ void SqliteStatementDeleter::operator()(
void ThrowSqliteError(sqlite3* db_handle, std::string_view action) { void ThrowSqliteError(sqlite3* db_handle, std::string_view action) {
const std::string message = const std::string message =
db_handle != nullptr ? sqlite3_errmsg(db_handle) : "unknown SQLite error"; db_handle != nullptr ? sqlite3_errmsg(db_handle) : "unknown SQLite error";
throw std::runtime_error(std::format("{}: {}", action, message)); throw std::runtime_error(std::string(action) + ": " + message);
} }
SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path) { SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path) {
@@ -51,7 +50,7 @@ void ExecSql(const SqliteDatabaseHandle& db_handle, std::string_view sql,
? error_message ? error_message
: sqlite3_errmsg(db_handle.get()); : sqlite3_errmsg(db_handle.get());
sqlite3_free(error_message); sqlite3_free(error_message);
throw std::runtime_error(std::format("{}: {}", action, message)); throw std::runtime_error(std::string(action) + ": " + message);
} }
} }

View File

@@ -1,4 +1,4 @@
#include "services/database/sqlite_statement_helpers.h" #include "services/sqlite_statement_helpers.h"
#include <boost/json.hpp> #include <boost/json.hpp>
#include <cstring> #include <cstring>
@@ -6,7 +6,7 @@
#include <memory> #include <memory>
#include <stdexcept> #include <stdexcept>
#include "services/database/sqlite_connection_helpers.h" #include "services/sqlite_connection_helpers.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {

View File

@@ -4,27 +4,12 @@
*/ */
#include <filesystem> #include <filesystem>
#include <format>
#include <memory> #include <memory>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path(std::format("biergarten_seed_{}-{}.sqlite",
run_timestamp_utc_, suffix));
}
return candidate;
}
void SqliteExportService::InitializeSchema() const { void SqliteExportService::InitializeSchema() const {
sqlite_export_service_internal::ExecSql( sqlite_export_service_internal::ExecSql(

View File

@@ -8,8 +8,8 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h" #include "services/sqlite_export_service_helpers.h"
constexpr int kLocationPrecision = 17; constexpr int kLocationPrecision = 17;

View File

@@ -3,7 +3,7 @@
* @brief SqliteExportService constructor and destructor implementation. * @brief SqliteExportService constructor and destructor implementation.
*/ */
#include "services/database/sqlite_export_service.h" #include "services/sqlite_export_service.h"
#include <memory> #include <memory>

View File

@@ -0,0 +1,61 @@
/**
* @file wikipedia/fetch_extract.cc
* @brief WikipediaService::FetchExtract() implementation.
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <string>
#include <string_view>
#include "services/wikipedia_service.h"
std::string WikipediaService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
const auto cache_it = this->extract_cache_.find(cache_key);
if (cache_it != this->extract_cache_.end()) {
return cache_it->second;
}
const std::string encoded = this->client_->UrlEncode(cache_key);
const std::string url =
"https://en.wikipedia.org/w/api.php?action=query&titles=" + encoded +
"&prop=extracts&explaintext=1&format=json";
const std::string body = this->client_->Get(url);
boost::system::error_code parse_error;
boost::json::value doc = boost::json::parse(body, parse_error);
if (!parse_error && doc.is_object()) {
try {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
const std::string_view extract_view = page.at("extract").as_string();
std::string extract(extract_view);
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
this->extract_cache_.emplace(cache_key, extract);
return extract;
}
}
this->extract_cache_.emplace(cache_key, std::string{});
} catch (const std::exception& e) {
spdlog::warn(
"WikipediaService: failed to parse response structure for '{}': "
"{}",
query, e.what());
return {};
}
} else if (parse_error) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
parse_error.message());
}
return {};
}

View File

@@ -0,0 +1,47 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <string>
#include "services/wikipedia_service.h"
std::string WikipediaService::GetLocationContext(const Location& loc) {
if (!client_) {
return {};
}
std::string result;
std::string region_query(loc.city);
if (!loc.country.empty()) {
region_query += ", ";
region_query += loc.country;
}
const std::string beer_query = "beer in " + loc.country;
const std::string city_beer_query = "beer in " + loc.city;
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(region_query));
append_extract(FetchExtract(beer_query));
append_extract(FetchExtract(city_beer_query));
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", region_query,
e.what());
}
return result;
}

View File

@@ -0,0 +1,11 @@
/**
* @file services/wikipedia/wikipedia_service.cc
* @brief WikipediaService constructor implementation.
*/
#include "services/wikipedia_service.h"
#include <utility>
WikipediaService::WikipediaService(std::unique_ptr<WebClient> client)
: client_(std::move(client)) {}

View File

@@ -0,0 +1,19 @@
/**
* @file web_client/curl_global_state.cc
* @brief CurlGlobalState constructor and destructor implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include "web_client/curl_web_client.h"
CurlGlobalState::CurlGlobalState() {
if (curl_global_init(CURL_GLOBAL_DEFAULT) != CURLE_OK) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl globally");
}
}
CurlGlobalState::~CurlGlobalState() { curl_global_cleanup(); }

View File

@@ -0,0 +1,87 @@
/**
* @file web_client/curl_web_client_get.cc
* @brief CURLWebClient::Get() implementation.
*/
#include <curl/curl.h>
#include <cstdint>
#include <limits>
#include <memory>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
using CurlHandle = std::unique_ptr<CURL, decltype(&curl_easy_cleanup)>;
static constexpr long kConnectionTimeout = 10;
static constexpr long kRequestTimeout = 30;
static constexpr long kMaxRedirects = 5;
static constexpr int32_t kOkHttpStatus = 200;
static CurlHandle CreateHandle() {
CURL* handle = curl_easy_init();
if (handle == nullptr) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl handle");
}
return {handle, &curl_easy_cleanup};
}
static void SetCommonGetOptions(CURL* curl, const std::string& url) {
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, "biergarten-pipeline/0.1.0");
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
curl_easy_setopt(curl, CURLOPT_MAXREDIRS, kMaxRedirects);
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, kConnectionTimeout);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, kRequestTimeout);
curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip");
}
// curl write callback that appends response data into a std::string
static size_t WriteCallbackString(void* contents, const size_t size,
const size_t nmemb, void* userp) {
const size_t real_size = size * nmemb;
auto* str = static_cast<std::string*>(userp);
str->append(static_cast<char*>(contents), real_size);
return real_size;
}
std::string CURLWebClient::Get(const std::string& url) {
const CurlHandle curl = CreateHandle();
std::string response_string;
SetCommonGetOptions(curl.get(), url);
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, WriteCallbackString);
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &response_string);
CURLcode curl_result = curl_easy_perform(curl.get());
if (curl_result != CURLE_OK) {
const auto error = std::string("[CURLWebClient] GET failed: ") +
curl_easy_strerror(curl_result);
throw std::runtime_error(error);
}
long curl_http_code = 0;
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &curl_http_code);
if (curl_http_code < std::numeric_limits<int32_t>::min() ||
curl_http_code > std::numeric_limits<int32_t>::max()) {
throw std::runtime_error("[CURLWebClient] Invalid HTTP status code: " +
std::to_string(curl_http_code));
}
const int32_t http_code = static_cast<int32_t>(curl_http_code);
if (http_code != kOkHttpStatus) {
const std::string error = "[CURLWebClient] HTTP error " +
std::to_string(http_code) + " for URL " + url;
throw std::runtime_error(error);
}
return response_string;
}

View File

@@ -0,0 +1,24 @@
/**
* @file web_client/curl_web_client_url_encode.cc
* @brief CURLWebClient::UrlEncode() implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
std::string CURLWebClient::UrlEncode(const std::string& value) {
// A NULL handle is fine for UTF-8 encoding according to libcurl docs.
char* output = curl_easy_escape(nullptr, value.c_str(), 0);
if (!output) {
throw std::runtime_error("[CURLWebClient] curl_easy_escape failed");
}
std::string result(output);
curl_free(output);
return result;
}

View File

@@ -1,73 +0,0 @@
/**
* @file web_client/http_web_client.cc
* @brief cpp-httplib implementation of WebClient.
*/
#include "web_client/http_web_client.h"
#include <httplib.h>
#include <chrono>
#include <format>
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include "services/logging/logger.h"
namespace {
constexpr time_t kConnectionTimeoutSeconds = 5;
constexpr time_t kReadTimeoutSeconds = 10;
constexpr int kSuccessMin = 200;
constexpr int kSuccessMax = 300;
const std::regex kUrlRegex(
R"(^(https?://[^/?#]+)(/[^?#]*(?:\?[^#]*)?(?:#.*)?)?)");
std::pair<std::string, std::string> SplitUrl(const std::string& url) {
std::smatch match;
if (!std::regex_match(url, match, kUrlRegex)) {
throw std::invalid_argument("[HttpWebClient] Malformed URL: " + url);
}
return {match[1].str(), match[2].matched ? match[2].str() : "/"};
}
} // namespace
std::string HttpWebClient::Get(const std::string& url) {
const auto [origin, path] = SplitUrl(url);
httplib::Client client(origin);
client.set_follow_location(true);
client.set_connection_timeout(kConnectionTimeoutSeconds);
client.set_read_timeout(kReadTimeoutSeconds);
client.set_default_headers({{"Accept", "application/json"},
{"User-Agent", "biergarten-pipeline/1.0"}});
const httplib::Result result = client.Get(path);
if (!result) {
throw std::runtime_error(std::format(
"[HttpWebClient] Request failed for URL: {} — {}", url,
httplib::to_string(result.error())));
}
if (result->status < kSuccessMin || result->status >= kSuccessMax) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Error,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("[HttpWebClient] Request failed for URL: {}", url)});
}
throw std::runtime_error(std::format("[HttpWebClient] HTTP {} for URL: {}",
result->status, url));
}
return result->body;
}
std::string HttpWebClient::EncodeURL(const std::string& value) {
return httplib::encode_uri_component(value);
}