8 Commits

Author SHA1 Message Date
6a66619c70 Add multithreaded logging infrastructure for preparation for future designs (#225)
* Update class diagrams

* Implement BoundedChannel and multithreaded logging infra

* Integrate logging channel system

* Update string concatenations to use std::format

* Add pretty print log
2026-05-22 22:00:38 -04:00
2ee7b3d2a2 Add timeout to wikipedia enrichment to avoid breaking rate limits, add mock enrichment (#224)
* Add timeout for enrichment, refactor json deserialization

* Add location count to application options and as a cli arg

* Add mock enrichment process
2026-05-14 19:15:51 -04:00
b7c0b1c8d4 Fix mistake in .gitattributes
archive/* is incorrect as it will ignore sub-dirs
2026-05-12 01:05:07 -04:00
b8ebe03921 Pipeline: Add Runpod docker configuration (#222)
* Begin work on Runpod docker config

* Reduce docker image size

* Create .dockerignore
2026-05-12 00:44:09 -04:00
26635ace84 Organize and consolidate header files (#220) 2026-05-03 21:44:37 -04:00
031be8ad5d Pipeline: Remove CURL as a dependency, add new HTTP module (#219)
Rationale: 

HTTP is a supporting concern in the pipeline, used only for Wikipedia enrichment calls. libcurl's C API required significant boilerplate to wrap safely. cpp-httplib is a header-only library that covers the same functionality with far less overhead and no manual resource management.
2026-05-03 13:35:58 -04:00
f316fabcb0 Update CMakeLists.txt (#218) 2026-05-02 19:27:44 -04:00
b1dc8e0b5d refactor(pipeline): restructure config, add PromptDirectory, consolidate SQLite layer (#217)
* Refactor ApplicationOptions to separate config concerns

* add prompt dir app option

* readability updates: remove magic numbers, update comments

* codebase formatting

* Update docs

* Extract argument parsing, timer out of
2026-05-02 18:27:14 -04:00
100 changed files with 3080 additions and 1450 deletions

2
.gitattributes vendored
View File

@@ -1 +1 @@
archive/* linguist-vendored archive/** linguist-vendored

View File

@@ -18,6 +18,7 @@ descriptions via a local GGUF model or a deterministic mock.
- [Build](#build) - [Build](#build)
- [Model](#model) - [Model](#model)
- [Run](#run) - [Run](#run)
- [Docker / RunPod](#docker--runpod)
- [Architecture](#architecture) - [Architecture](#architecture)
- [Pipeline Stages](#pipeline-stages) - [Pipeline Stages](#pipeline-stages)
- [Key Components](#key-components) - [Key Components](#key-components)
@@ -51,7 +52,7 @@ step.
### Build ### Build
Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and Requirements: C++20 compiler, CMake 3.31+, OpenSSL, Boost (JSON and
ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system
SQLite package is required. SQLite package is required.
@@ -60,6 +61,16 @@ cmake -S . -B build
cmake --build build cmake --build build
``` ```
CMake automatically detects whether a compatible llama.cpp installation is
present on the system (`libllama`, `libggml`, `libggml-base`, and `llama.h`
visible on the default search paths). If found, it links against those
libraries and skips the FetchContent build. If not found, it fetches and builds
llama.cpp from source at tag `b9012`. No additional flags are required in
either case.
Metal is enabled automatically on Apple Silicon. CUDA or HIP/ROCm is detected
automatically on Linux when the relevant toolkit is present.
### Model ### Model
> Skip this step if you only need `--mocked`. > Skip this step if you only need `--mocked`.
@@ -74,20 +85,27 @@ curl -L \
### Run ### Run
Run from `build/` so the copied `locations.json` and `prompts/` are available. Run from `build/` so the copied `locations.json` and `prompts/` are available.
Each run also writes a fresh dated SQLite file such as Each run writes a fresh dated SQLite file such as
`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. `biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory.
```bash ```bash
./biergarten-pipeline --mocked ./biergarten-pipeline --mocked
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
./biergarten-pipeline \
--model ../models/google_gemma-4-E4B-it-Q6_K.gguf \
--prompt-dir prompts \
--temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
``` ```
#### CLI Flags #### CLI Flags
| Flag | Purpose | | Flag | Purpose |
| --------------- | ------------------------------------------------------- | | --------------- | ---------------------------------------------------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. | | `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. | | `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--prompt-dir` | Directory containing prompt files (e.g. `BREWERY_GENERATION.md`). Required unless `--mocked` is set. |
| `--output, -o` | Directory for generated SQLite artifacts. Default: `output`. |
| `--log-path` | Path for application logs. Default: `pipeline.log`. |
| `--temperature` | Sampling temperature. Default: `1.0`. | | `--temperature` | Sampling temperature. Default: `1.0`. |
| `--top-p` | Nucleus sampling. Default: `0.95`. | | `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--top-k` | Top-k sampling. Default: `64`. | | `--top-k` | Top-k sampling. Default: `64`. |
@@ -100,7 +118,91 @@ error before the pipeline starts. Sampling flags are ignored when `--mocked` is
set. set.
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after The post-build step copies `prompts/` into `build/prompts/`. Rebuild after
editing `prompts/system.md`. editing any prompt file.
---
## Docker / RunPod
The `tooling/pipeline/runpod/` directory contains a GPU-ready container
configuration for running the pipeline on RunPod or any Docker host with an
NVIDIA GPU.
### How it works
The container uses a two-stage build. The first stage pulls prebuilt
`libllama`, `libggml`, and backend plugin libraries (including `libggml-cuda.so`
and the CPU variant plugins) from `ghcr.io/ggml-org/llama.cpp:full-cuda`. The
second stage copies those libraries into `/usr/local/lib` and runs `ldconfig` so
the dynamic linker and `dlopen` calls from `ggml_backend_load_all()` can resolve
the CUDA backend plugin at runtime. llama.cpp headers are cloned at the matching
tag and installed into `/usr/local/include`. CMake auto-detects both and skips
the FetchContent source build entirely, keeping image build times short.
`GGML_BACKEND_PATH` is set to `/usr/local/lib` so llama.cpp knows where to scan
for backend plugins.
### Build the image
Run from the `tooling/pipeline/` directory (the CMake project root), not from
inside `runpod/`, so the `COPY . .` step picks up the full project context.
```bash
docker build -t biergarten-pipeline:latest -f runpod/Dockerfile .
```
To monitor the full build output and confirm CMake selects the system llama.cpp:
```bash
docker build \
--progress=plain \
--no-cache \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```
Look for `[biergarten] Found system llama.cpp — skipping FetchContent` in the
output to confirm the fast path was taken.
### Run in mocked mode
No model or GPU required. Useful for validating the pipeline logic and SQLite
export path.
```bash
docker run --rm \
-e BIERGARTEN_MODE=mocked \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
### Run in live mode
Mount your GGUF model before starting. The container validates the model path
before launching the binary.
```bash
docker run --rm \
--runtime=nvidia \
-e BIERGARTEN_MODE=live \
-e GGML_BACKEND_PATH="/usr/local/lib/libggml-cuda.so" \
-v "$PWD/models:/workspace/models" \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
The model must be present at `./models/google_gemma-4-E4B-it-Q6_K.gguf` on the
host. See [Model](#model) above for the download command.
### RunPod deployment
Use a GPU pod template. Mount persistent storage for `/workspace/models`,
`/workspace/output`, and `/workspace/logs`. Set `BIERGARTEN_MODE=live` in the
template environment. See `tooling/pipeline/runpod/pod-template.yaml` for a
starter template.
--- ---
@@ -197,16 +299,18 @@ code, latitude, and longitude for each entry.
## Tech Stack ## Tech Stack
- C++20 - C++20
- CMake 3.24+ - CMake 3.31+
- Boost.JSON, Boost.ProgramOptions, Boost.DI - Boost.JSON, Boost.ProgramOptions, Boost.DI
- spdlog - spdlog
- libcurl - cpp-httplib (with OpenSSL)
- SQLite amalgamation fetched and compiled via CMake FetchContent - SQLite amalgamation fetched and compiled via CMake FetchContent
- llama.cpp - llama.cpp (auto-detected from system install or fetched via FetchContent)
- Docker with NVIDIA CUDA 12.6 base image for GPU container builds
- RunPod for cloud GPU inference
The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is The build fetches Boost.DI, spdlog, and SQLite via CMake. llama.cpp is fetched
enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit only when a system installation is not detected. Metal is enabled on Apple
is present. Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
> **Code Style:** Modern C++20 throughout — RAII for ownership, > **Code Style:** Modern C++20 throughout — RAII for ownership,
> `std::unique_ptr` for injected dependencies, `std::optional` for parse > `std::unique_ptr` for injected dependencies, `std::optional` for parse
@@ -218,7 +322,7 @@ is present.
## Tested Hardware ## Tested Hardware
### ARM macOS - M1 Pro ### ARM macOS M1 Pro
| | | | | |
| --------- | --------------------------------- | | --------- | --------------------------------- |
@@ -229,7 +333,7 @@ is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with Metal | | Inference | llama.cpp with Metal |
### x86_64 Linux - NVIDIA RTX 2000 ### x86_64 Linux NVIDIA RTX 2000
| | | | | |
| --------- | ------------------------------ | | --------- | ------------------------------ |
@@ -240,6 +344,15 @@ is present.
| Model | Gemma 4 E4B | | Model | Gemma 4 E4B |
| Inference | llama.cpp with CUDA 12.x | | Inference | llama.cpp with CUDA 12.x |
### x86_64 Linux — Docker / RunPod (NVIDIA CUDA)
| | |
| --------- | ------------------------------------------- |
| Host | RunPod GPU pod |
| Base | nvidia/cuda:12.6.3-devel-ubuntu24.04 |
| Model | Gemma 4 E4B Q6_K |
| Inference | llama.cpp prebuilt CUDA backends via dlopen |
--- ---
## Fixture Strategy ## Fixture Strategy
@@ -260,8 +373,9 @@ is present.
| `includes/` | Public headers and shared models. | | `includes/` | Public headers and shared models. |
| `src/` | Implementation files. | | `src/` | Implementation files. |
| `locations.json` | Curated city input copied into the build tree. | | `locations.json` | Curated city input copied into the build tree. |
| `prompts/` | System prompt used by the model-backed path. | | `prompts/` | System prompts used by the model-backed path. |
| `diagrams/` | Architecture and pipeline diagrams. | | `diagrams/` | Architecture and pipeline diagrams. |
| `tooling/pipeline/runpod/` | Dockerfile, launcher, and RunPod pod template. |
| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. | | `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. |
--- ---
@@ -276,6 +390,7 @@ is present.
- `src/data_generation/llama/` — local inference, prompt loading, output - `src/data_generation/llama/` — local inference, prompt loading, output
validation. validation.
- `src/data_generation/mock/` — deterministic fallback. - `src/data_generation/mock/` — deterministic fallback.
- `tooling/pipeline/runpod/` — container build and runtime launcher.
--- ---

View File

@@ -29,7 +29,7 @@ if (Are arguments valid?) then (no)
else (yes) else (yes)
endif endif
:Init CurlGlobalState & LlamaBackendState; :Init OpenSSL global state & LlamaBackendState;
:di::make_injector(...); :di::make_injector(...);
:injector.create<std::unique_ptr<BiergartenDataGenerator>>(); :injector.create<std::unique_ptr<BiergartenDataGenerator>>();
:BiergartenDataGenerator::Run(); :BiergartenDataGenerator::Run();

View File

@@ -26,6 +26,7 @@ skinparam note {
title The Biergarten Data Pipeline - Class Diagram title The Biergarten Data Pipeline - Class Diagram
class BiergartenDataGenerator { class BiergartenDataGenerator {
- logger_ : std::shared_ptr<ILogger>
- context_service_ : std::unique_ptr<IEnrichmentService> - context_service_ : std::unique_ptr<IEnrichmentService>
- generator_ : std::unique_ptr<DataGenerator> - generator_ : std::unique_ptr<DataGenerator>
- exporter_ : std::unique_ptr<IExportService> - exporter_ : std::unique_ptr<IExportService>
@@ -36,6 +37,46 @@ class BiergartenDataGenerator {
- LogResults() : void - LogResults() : void
} }
class LogLevel <<enumeration>> {
Debug
Info
Warn
Error
}
class PipelinePhase <<enumeration>> {
Startup
UserGeneration
BreweryAndBeerGeneration
CheckinGeneration
RatingGeneration
FollowGeneration
Teardown
}
struct LogEntry {
+ timestamp : std::chrono::system_clock::time_point
+ level : LogLevel
+ phase : PipelinePhase
+ message : std::string
+ worker : std::optional<std::string>
}
interface ILogger <<interface>> {
+ Log(entry : const LogEntry&) : void
}
class LogProducer {
- channel_ : BoundedChannel<LogEntry>&
+ Log(entry : const LogEntry&) : void
}
class LogDispatcher {
- channel_ : BoundedChannel<LogEntry>&
+ Run() : void
- ToSpdlogLevel(level) : spdlog::level::level_enum
}
interface IEnrichmentService <<interface>> { interface IEnrichmentService <<interface>> {
+ GetLocationContext(loc : const Location&) : std::string + GetLocationContext(loc : const Location&) : std::string
} }
@@ -52,7 +93,7 @@ interface WebClient <<interface>> {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class CURLWebClient { class HttpWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
@@ -123,14 +164,21 @@ class SystemDateTimeProvider {
} }
' Structural Relationships / Dependency Injection ' Structural Relationships / Dependency Injection
BiergartenDataGenerator *-- ILogger : owns
BiergartenDataGenerator *-- IEnrichmentService : owns BiergartenDataGenerator *-- IEnrichmentService : owns
BiergartenDataGenerator *-- DataGenerator : owns BiergartenDataGenerator *-- DataGenerator : owns
BiergartenDataGenerator *-- IExportService : owns BiergartenDataGenerator *-- IExportService : owns
LogEntry *-- LogLevel
LogEntry *-- PipelinePhase
ILogger <|.. LogProducer : implements
LogProducer ..> LogEntry : emits
LogDispatcher ..> LogEntry : consumes
IEnrichmentService <|.. WikipediaService : implements IEnrichmentService <|.. WikipediaService : implements
WikipediaService *-- WebClient : owns WikipediaService *-- WebClient : owns
WebClient <|.. CURLWebClient : implements WebClient <|.. HttpWebClient : implements
DataGenerator <|.. MockGenerator : implements DataGenerator <|.. MockGenerator : implements
DataGenerator <|.. LlamaGenerator : implements DataGenerator <|.. LlamaGenerator : implements

View File

@@ -13,7 +13,7 @@ if (Invalid args?) then (yes)
stop stop
else (no) else (no)
endif endif
:Init CurlGlobalState & LlamaBackendState; :Init OpenSSL global state & LlamaBackendState;
:Build DI injector; :Build DI injector;
:Initialize SqliteExportService; :Initialize SqliteExportService;

View File

@@ -1,4 +1,4 @@
@startuml @startuml class_diagram
' ========================================== ' ==========================================
' CONFIGURATION & STYLING ' CONFIGURATION & STYLING
@@ -8,6 +8,8 @@ skinparam classAttributeFontSize 9
skinparam defaultFontSize 25 skinparam defaultFontSize 25
skinparam titleFontSize 30 skinparam titleFontSize 30
title Biergarten Data Pipeline — Class Diagram
package "Domain: Models" { package "Domain: Models" {
class Location { class Location {
@@ -154,7 +156,7 @@ package "Domain: Application Configuration"{
class GeneratorOptions { class GeneratorOptions {
+ model_path: std::filesystem::path + model_path: std::filesystem::path
+ use_mocked: bool = false + use_mocked: bool = false
+ sampling : SamplingOptions + sampling: std::optional<SamplingOptions>
} }
class PipelineOptions { class PipelineOptions {
@@ -167,10 +169,9 @@ package "Domain: Application Configuration"{
+ pipeline: PipelineOptions + pipeline: PipelineOptions
} }
' --- Domain Model Relationships ---
ApplicationOptions *-- GeneratorOptions ApplicationOptions *-- GeneratorOptions
ApplicationOptions *-- PipelineOptions ApplicationOptions *-- PipelineOptions
GeneratorOptions *-- SamplingOptions GeneratorOptions o-- SamplingOptions
} }
package "Domain: Policy" { package "Domain: Policy" {
@@ -274,33 +275,29 @@ package "Infrastructure: Logging" {
+ level : LogLevel + level : LogLevel
+ phase : PipelinePhase + phase : PipelinePhase
+ message : std::string + message : std::string
+ city : std::optional<std::string>
+ entity_id : std::optional<std::string>
+ worker : std::optional<std::string> + worker : std::optional<std::string>
} }
interface Logger <<interface>> { interface ILogger <<interface>> {
+ Log(level, phase, message,\n city, entity_id, worker) : void + Log(entry : const LogEntry&) : void
} }
class PipelineLogger { class LogProducer {
- log_ch_ : BoundedChannel<LogEntry>& - channel_ : BoundedChannel<LogEntry>&
+ Log(level, phase, message,\n city, entity_id, worker) : void + Log(entry : const LogEntry&) : void
} }
class LogWorker { class LogDispatcher {
- log_ch_ : BoundedChannel<LogEntry>& - channel_ : BoundedChannel<LogEntry>&
+ Run() : void + Run() : void
- FormatTimestamp(tp) : std::string
- ToSpdlogLevel(level) : spdlog::level::level_enum - ToSpdlogLevel(level) : spdlog::level::level_enum
- ToString(phase) : std::string
} }
' --- Logging Relationships ---
LogEntry *-- LogLevel LogEntry *-- LogLevel
LogEntry *-- PipelinePhase LogEntry *-- PipelinePhase
PipelineLogger ..> LogEntry : emits ILogger <|.. LogProducer
LogWorker ..> LogEntry : consumes LogProducer ..> LogEntry : emits
LogDispatcher ..> LogEntry : consumes
} }
package "Infrastructure: Pipeline Channel" { package "Infrastructure: Pipeline Channel" {
@@ -355,13 +352,29 @@ package "Infrastructure: Enrichment" {
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
class CURLWebClient { class HttpWebClient {
+ Get(url : const std::string&) : std::string + Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string + UrlEncode(value : const std::string&) : std::string
} }
} }
package "Infrastructure: Prompting" {
interface IPromptDirectory <<interface>> {
+ Load(key : std::string_view) : std::string
}
class PromptDirectory {
- prompt_dir_ : std::filesystem::path
- cache_ : std::unordered_map<std::string, std::string>
+ PromptDirectory(prompt_dir : const std::filesystem::path&)
+ Load(key : std::string_view) : std::string
}
IPromptDirectory <|.. PromptDirectory
}
package "Infrastructure: Data Generation" { package "Infrastructure: Data Generation" {
interface DataGenerator <<interface>> { interface DataGenerator <<interface>> {
@@ -385,6 +398,7 @@ package "Infrastructure: Data Generation" {
- model_ : ModelHandle - model_ : ModelHandle
- context_ : ContextHandle - context_ : ContextHandle
- prompt_formatter_ : std::unique_ptr<PromptFormatter> - prompt_formatter_ : std::unique_ptr<PromptFormatter>
- prompt_directory_ : std::unique_ptr<IPromptDirectory>
- rng_ : std::mt19937 - rng_ : std::mt19937
+ GenerateBrewery(...) : BreweryResult + GenerateBrewery(...) : BreweryResult
+ GenerateBeer(...) : BeerResult + GenerateBeer(...) : BeerResult
@@ -458,8 +472,6 @@ package "Infrastructure: Data Export" {
} }
class BiergartenPipelineOrchestrator { class BiergartenPipelineOrchestrator {
- preloader_ : std::unique_ptr<DataPreloader> - preloader_ : std::unique_ptr<DataPreloader>
- enrichment_service_ : std::unique_ptr<EnrichmentService> - enrichment_service_ : std::unique_ptr<EnrichmentService>
@@ -519,7 +531,7 @@ CheckinDistributionStrategy <|.. RandomCheckinStrategy
FollowGenerationStrategy <|.. RandomFollowStrategy FollowGenerationStrategy <|.. RandomFollowStrategy
FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy
EnrichmentService <|.. WikipediaService EnrichmentService <|.. WikipediaService
WebClient <|.. CURLWebClient WebClient <|.. HttpWebClient
DataGenerator <|.. MockGenerator DataGenerator <|.. MockGenerator
DataGenerator <|.. LlamaGenerator DataGenerator <|.. LlamaGenerator
PromptFormatter <|.. Gemma4JinjaPromptFormatter PromptFormatter <|.. Gemma4JinjaPromptFormatter
@@ -530,6 +542,7 @@ DateTimeProvider <|.. SystemDateTimeProvider
WikipediaService *-- WebClient WikipediaService *-- WebClient
WikipediaService ..> ContextStrategy WikipediaService ..> ContextStrategy
LlamaGenerator *-- PromptFormatter LlamaGenerator *-- PromptFormatter
LlamaGenerator *-- IPromptDirectory
LlamaGenerator ..> GeneratorOptions LlamaGenerator ..> GeneratorOptions
SqliteExportService *-- DateTimeProvider SqliteExportService *-- DateTimeProvider

View File

@@ -0,0 +1,9 @@
build/
cmake-build-debug/
.git/
.idea/
**/*.sqlite
**/*.log
**/*.sqlite3
**/*.db

View File

@@ -6,3 +6,4 @@ data
models models
*.gguf *.gguf
BiergartenPipeline.png BiergartenPipeline.png
output

View File

@@ -1,13 +1,20 @@
cmake_minimum_required(VERSION 3.24) cmake_minimum_required(VERSION 3.31)
project(biergarten-pipeline) project(biergarten-pipeline)
set(CMAKE_POLICY_VERSION_MINIMUM 3.5 CACHE STRING "" FORCE) # Set policy to allow FetchContent_Populate for header-only libraries
# that have outdated CMakeLists.txt files
cmake_policy(SET CMP0169 OLD)
# ============================================================================= # 1. Build Options
# 1. Platform & GPU Detection
# ============================================================================= option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF)
if(WIN32) if(BIERGARTEN_MOCK_ONLY)
message(FATAL_ERROR "[biergarten] Windows is currently not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).") message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.")
endif()
# 2. Platform & GPU Detection
if(NOT UNIX)
message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif() endif()
if(APPLE) if(APPLE)
@@ -18,15 +25,15 @@ if(APPLE)
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.") message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE) set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif() endif()
elseif(UNIX AND NOT APPLE) else()
find_package(CUDAToolkit QUIET) find_package(CUDAToolkit QUIET)
find_package(HIP QUIET) find_package(hip CONFIG QUIET)
if(CUDAToolkit_FOUND) if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.") message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE) set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native) set(CMAKE_CUDA_ARCHITECTURES native)
elseif(HIP_FOUND OR EXISTS "/opt/rocm") elseif(hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.") message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE) set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else() else()
@@ -34,71 +41,79 @@ elseif(UNIX AND NOT APPLE)
endif() endif()
endif() endif()
# ============================================================================= # 3. Project-wide Settings
# 2. Project-wide Settings (Standard & Optimization)
# =============================================================================
set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
# Release Build Optimization: Aggressive (-O3), Arch-specific, and LTO
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g") set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g")
# ============================================================================= # 4. Dependencies
# 3. Dependencies
# =============================================================================
include(FetchContent) include(FetchContent)
find_package(CURL QUIET) # Boost (system install — via dnf/brew)
if(NOT CURL_FOUND)
message(FATAL_ERROR "[biergarten] libcurl not found. Install it (e.g. 'sudo dnf install libcurl-devel').")
endif()
# Require system Boost for JSON and Program Options to speed up build times
find_package(Boost REQUIRED COMPONENTS json program_options) find_package(Boost REQUIRED COMPONENTS json program_options)
FetchContent_Declare( # Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency)
sqlite_amalgamation # Header-only library, so we only fetch without invoking its CMakeLists.txt
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
)
FetchContent_GetProperties(sqlite_amalgamation)
if(NOT sqlite_amalgamation_POPULATED)
FetchContent_Populate(sqlite_amalgamation)
endif()
if(NOT TARGET sqlite3)
add_library(sqlite3 STATIC
${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c
)
target_include_directories(sqlite3 PUBLIC
${sqlite_amalgamation_SOURCE_DIR}
)
target_compile_definitions(sqlite3 PUBLIC
SQLITE_THREADSAFE=1
)
endif()
FetchContent_Declare(
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b8742
)
FetchContent_MakeAvailable(llama-cpp)
FetchContent_Declare( FetchContent_Declare(
boost-di boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0 GIT_TAG v1.3.0
GIT_SHALLOW TRUE
) )
FetchContent_MakeAvailable(boost-di) FetchContent_GetProperties(boost-di)
if(TARGET Boost.DI AND NOT TARGET boost::di) if(NOT boost-di_POPULATED)
add_library(boost::di ALIAS Boost.DI) FetchContent_Populate(boost-di)
endif() endif()
add_library(boost_di INTERFACE)
add_library(boost::di ALIAS boost_di)
target_include_directories(boost_di INTERFACE
$<BUILD_INTERFACE:${boost-di_SOURCE_DIR}/include>
)
# SQLite amalgamation
FetchContent_Declare(
sqlite_amalgamation
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
EXCLUDE_FROM_ALL
)
FetchContent_MakeAvailable(sqlite_amalgamation)
if(NOT TARGET sqlite3)
add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c)
target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR})
target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1)
endif()
# llama.cpp — skipped for mock-only builds
if(NOT BIERGARTEN_MOCK_ONLY)
find_library(LLAMA_LIB NAMES llama)
find_library(GGML_LIB NAMES ggml)
find_library(GGML_BASE_LIB NAMES ggml-base)
find_path(LLAMA_INC_DIR NAMES llama.h PATH_SUFFIXES include)
if(LLAMA_LIB AND GGML_LIB AND GGML_BASE_LIB AND LLAMA_INC_DIR)
message(STATUS "[biergarten] Found system llama.cpp — skipping FetchContent")
add_library(llama SHARED IMPORTED)
set_target_properties(llama PROPERTIES
IMPORTED_LOCATION "${LLAMA_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INC_DIR}"
INTERFACE_LINK_LIBRARIES "${GGML_LIB};${GGML_BASE_LIB}"
)
else()
message(STATUS "[biergarten] System llama.cpp not found — fetching via FetchContent")
FetchContent_Declare(
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b9012
)
FetchContent_MakeAvailable(llama-cpp)
endif()
endif()
# spdlog
FetchContent_Declare( FetchContent_Declare(
spdlog spdlog
GIT_REPOSITORY https://github.com/gabime/spdlog.git GIT_REPOSITORY https://github.com/gabime/spdlog.git
@@ -106,73 +121,148 @@ FetchContent_Declare(
) )
FetchContent_MakeAvailable(spdlog) FetchContent_MakeAvailable(spdlog)
# ============================================================================= # cpp-httplib — header-only HTTP/HTTPS client replacing libcurl.
# 4. Sources # OpenSSL is required for HTTPS (Wikipedia API). find_package locates
# ============================================================================= # libssl/libcrypto; HTTPLIB_REQUIRE_OPENSSL causes a hard build failure
set(SOURCES # if OpenSSL is absent rather than silently producing an HTTP-only binary.
find_package(OpenSSL REQUIRED)
FetchContent_Declare(
cpp-httplib
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib.git
GIT_TAG v0.43.2
GIT_SHALLOW TRUE
SYSTEM
)
set(HTTPLIB_REQUIRE_OPENSSL ON CACHE BOOL "Require OpenSSL for cpp-httplib" FORCE)
FetchContent_MakeAvailable(cpp-httplib)
# 5. Executable & Sources
add_executable(${PROJECT_NAME}
includes/services/enrichment/mock_enrichment.h
includes/json_handling/pretty_print.h)
# --- Entry point ---
target_sources(${PROJECT_NAME} PRIVATE
src/main.cc src/main.cc
src/biergarten_data_generator/biergarten_data_generator.cc
src/biergarten_data_generator/run.cc
src/biergarten_data_generator/query_cities_with_countries.cc
src/biergarten_data_generator/generate_breweries.cc
src/biergarten_data_generator/log_results.cc
src/services/wikipedia/wikipedia_service.cc
src/services/wikipedia/get_summary.cc
src/services/wikipedia/fetch_extract.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/build_database_path.cc
src/services/sqlite/process_record.cc
src/services/sqlite/initialize.cc
src/services/sqlite/finalize.cc
src/web_client/curl_global_state.cc
src/web_client/curl_web_client_get.cc
src/web_client/curl_web_client_url_encode.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/generate_user.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/load.cc
src/data_generation/llama/load_brewery_prompt.cc
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
src/data_generation/mock/deterministic_hash.cc
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/json_handling/json_loader.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cpp
src/services/sqlite/helpers/sqlite_statement_helpers.cpp
) )
# ============================================================================= # --- json_handling ---
# 5. Target target_sources(${PROJECT_NAME} PRIVATE
# ============================================================================= src/json_handling/json_loader.cc
add_executable(${PROJECT_NAME} ${SOURCES}) )
# --- application_options ---
target_sources(${PROJECT_NAME} PRIVATE
src/application_options/parse_arguments.cc
)
# --- biergarten_pipeline_orchestrator ---
target_sources(${PROJECT_NAME} PRIVATE
src/biergarten_pipeline_orchestrator/log_results.cc
src/biergarten_pipeline_orchestrator/biergarten_pipeline_orchestrator.cc
src/biergarten_pipeline_orchestrator/generate_breweries.cc
src/biergarten_pipeline_orchestrator/run.cc
src/biergarten_pipeline_orchestrator/query_cities_with_countries.cc
)
# --- web_client ---
target_sources(${PROJECT_NAME} PRIVATE
src/web_client/http_web_client.cc
)
# --- data_generation: prompt_formatting ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
)
# --- data_generation: mock ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/data_generation/mock/deterministic_hash.cc
)
# --- data_generation: llama (skipped for mock-only builds) ---
if(NOT BIERGARTEN_MOCK_ONLY)
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/llama/load.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_user.cc
)
endif()
# --- services: wikipedia ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/enrichment/wikipedia/wikipedia_service.cc
src/services/enrichment/wikipedia/fetch_extract.cc
src/services/enrichment/wikipedia/get_summary.cc
)
# --- services: sqlite ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/sqlite/process_record.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/finalize.cc
src/services/sqlite/initialize.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cc
src/services/sqlite/helpers/sqlite_statement_helpers.cc
)
# --- services: logging ---
target_sources(${PROJECT_NAME} PRIVATE
"src/services/logging/log_producer.cc"
src/services/logging/log_dispatcher.cc
)
# --- services (top-level) ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/prompt_directory.cc
)
# 6. Include Directories, Link Libraries & Compile Definitions
target_include_directories(${PROJECT_NAME} PRIVATE target_include_directories(${PROJECT_NAME} PRIVATE
includes includes
${llama-cpp_SOURCE_DIR}/include
${llama-cpp_SOURCE_DIR}/common
) )
target_link_libraries(${PROJECT_NAME} PRIVATE target_link_libraries(${PROJECT_NAME} PRIVATE
llama $<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:llama>
boost::di boost::di
Boost::json Boost::json
Boost::program_options Boost::program_options
spdlog::spdlog spdlog::spdlog
sqlite3 sqlite3
CURL::libcurl httplib::httplib
OpenSSL::SSL
OpenSSL::Crypto
) )
# ============================================================================= target_compile_definitions(${PROJECT_NAME} PRIVATE
# 6. Runtime Assets # Defined when -DBIERGARTEN_MOCK_ONLY=ON — skips llama.cpp entirely.
# ============================================================================= # Use #ifdef BIERGARTEN_MOCK_ONLY in source to guard llama-specific code.
$<$<BOOL:${BIERGARTEN_MOCK_ONLY}>:BIERGARTEN_MOCK_ONLY>
# Defined for Debug configuration builds.
# Use #ifdef DEBUG in source to enable debug-only behaviour (e.g. verbose logging).
$<$<CONFIG:Debug>:DEBUG>
)
target_compile_options(biergarten-pipeline PRIVATE
-fmacro-prefix-map=${CMAKE_SOURCE_DIR}/tooling/pipeline/src/=
)
# 7. Runtime Assets
configure_file( configure_file(
${CMAKE_SOURCE_DIR}/locations.json ${CMAKE_SOURCE_DIR}/locations.json
${CMAKE_BINARY_DIR}/locations.json ${CMAKE_BINARY_DIR}/locations.json
COPYONLY COPYONLY
) )
add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_directory COMMAND ${CMAKE_COMMAND} -E copy_directory
${CMAKE_SOURCE_DIR}/prompts ${CMAKE_SOURCE_DIR}/prompts
${CMAKE_BINARY_DIR}/prompts ${CMAKE_BINARY_DIR}/prompts
) )

View File

@@ -1,83 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
#define BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
/**
* @file biergarten_data_generator.h
* @brief Core orchestration class for pipeline data generation.
*/
#include <memory>
#include <span>
#include <vector>
#include "data_generation/data_generator.h"
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#include "data_model/location.h"
#include "services/enrichment_service.h"
#include "services/export_service.h"
/**
* @brief Main data generator class for the Biergarten pipeline.
*
* This class encapsulates the core logic for generating brewery data.
* It handles location loading, city enrichment, and brewery generation.
*/
class BiergartenDataGenerator {
public:
/**
* @brief Construct a BiergartenDataGenerator with injected dependencies.
*
* @param context_service Context provider for sampled locations.
* @param generator Brewery and user data generator.
* @param exporter Storage backend for generated brewery data.
*/
BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter);
/**
* @brief Run the data generation pipeline.
*
* Performs the following steps:
* 1. Load curated locations from JSON
* 2. Resolve context for each city using the injected context service
* 3. Generate brewery data for sampled cities
*
* @return true if successful, false if not
*/
bool Run();
private:
/// @brief Owning context provider dependency.
std::unique_ptr<IEnrichmentService> context_service_;
/// @brief Generator dependency selected in the composition root.
std::unique_ptr<DataGenerator> generator_;
/// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_;
/**
* @brief Load locations from JSON and sample cities.
*
* @return Vector of sampled locations capped at 50 entries.
*/
static std::vector<Location> QueryCitiesWithCountries();
/**
* @brief Generate breweries for enriched cities.
*
* @param cities Span of enriched city data.
*/
void GenerateBreweries(std::span<const EnrichedCity> cities);
/**
* @brief Log the generated brewery results.
*/
void LogResults() const;
/// @brief Stores generated brewery data.
std::vector<GeneratedBrewery> generated_breweries_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_

View File

@@ -0,0 +1,102 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
#define BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_
/**
* @file biergarten_data_generator.h
* @brief Orchestration for end-to-end brewery data generation pipeline.
*
* Intent: Coordinates location loading, enrichment, and generation phases
* to produce a complete dataset. Coordinates dependencies via composition root.
*/
#include <memory>
#include <span>
#include <vector>
#include "data_generation/data_generator.h"
#include "data_model/generated_models.h"
#include "services/database/export_service.h"
#include "services/enrichment/enrichment_service.h"
#include "services/logging/logger.h"
/**
* @brief Main data generator class for the Biergarten pipeline.
*
* This class encapsulates the core logic for generating brewery data.
* It handles location loading, city enrichment, and brewery generation.
*/
class BiergartenPipelineOrchestrator {
public:
/**
* @brief Constructs the orchestrator with injected pipeline dependencies.
*
* @param context_service Provides regional context for locations.
* @param generator Implementation (Llama or Mock) for brewery/user generation.
* @param exporter Database backend for persisting generated records.
* @param application_options CLI configuration and paths.
*/
BiergartenPipelineOrchestrator(
std::shared_ptr<ILogger> logger,
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter,
const ApplicationOptions& application_options);
/**
* @brief Run the data generation pipeline.
*
* Performs the following steps:
* 1. Load curated locations from JSON
* 2. Resolve context for each city using the injected context service
* 3. Generate brewery data for sampled cities
*
* @note STRUCTURAL CONCURRENCY REQUIREMENT:
* When transitioned to a multithreaded design, this method MUST structurally
* enforce that all deployed worker threads are joined before returning (e.g.
* by using std::jthread or a structured concurrency primitive). This ensures
* workers do not attempt to log to a closed channel during application teardown.
*
* @return true if successful, false if not
*/
bool Run();
private:
/// @brief Logger instance for emitting pipeline messages.
std::shared_ptr<ILogger> logger_;
/// @brief Owning context provider dependency.
std::unique_ptr<IEnrichmentService> context_service_;
/// @brief Generator dependency selected in the composition root.
std::unique_ptr<DataGenerator> generator_;
/// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_;
/// @brief CLI configuration: paths, model settings, generation parameters.
ApplicationOptions application_options_;
/**
* @brief Load locations from JSON and sample cities.
*
* @return Vector of sampled locations capped at 50 entries.
*/
std::vector<Location> QueryCitiesWithCountries();
/**
* @brief Generate breweries for enriched cities.
*
* @param cities Span of enriched city data.
*/
void GenerateBreweries(std::span<const EnrichedCity> cities);
/**
* @brief Log the generated brewery results.
*/
void LogResults() const;
/// @brief Stores generated brewery data.
std::vector<GeneratedBrewery> generated_breweries_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_BIERGARTEN_DATA_GENERATOR_H_

View File

@@ -0,0 +1,73 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_CONCURRENCY_BOUNDED_CHANNEL_H_
#define BIERGARTEN_PIPELINE_INCLUDES_CONCURRENCY_BOUNDED_CHANNEL_H_
#include <condition_variable>
#include <cstddef>
#include <mutex>
#include <optional>
#include <queue>
/**
* @file bounded_channel.h
* @brief Thread-safe, bounded multi-producer/multi-consumer synchronous channel.
*
* Intent: Enables asynchronous inter-thread communication with backpressure.
* Models a synchronous channel where producers/consumers block on capacity limits.
*/
/**
* @class BoundedChannel
* @brief MPMC channel with fixed capacity and blocking semantics.
*
* Producers block when buffer is full; consumers block when empty.
* Close() unblocks all waiters and signals channel exhaustion.
*/
template <typename T>
class BoundedChannel {
// -------------------------------------------------------------------------
// Internal state — all access must be guarded by mutex_.
// -------------------------------------------------------------------------
std::queue<T> queue_;
std::mutex mutex_;
std::condition_variable not_full_;
std::condition_variable not_empty_;
std::size_t capacity_;
bool closed_ = false;
public:
/**
* @brief Construct a bounded channel with the given capacity.
* @param capacity Maximum number of items the channel may hold.
*/
explicit BoundedChannel(std::size_t capacity) : capacity_(capacity) {}
/**
* @brief Send an item into the channel. Blocks when the channel is full.
* @param item Move-only item to enqueue.
*/
void Send(T item);
/**
* @brief Receive an item from the channel. Blocks when the channel is
* empty.
* @return std::optional<T> containing the item, or std::nullopt when the
* channel is closed and drained.
*/
std::optional<T> Receive();
/**
* @brief Close the channel and unblock all waiting threads. Idempotent.
*/
void Close();
};
// Include the template implementation
#include "bounded_channel.tcc"
#endif // BIERGARTEN_PIPELINE_INCLUDES_CONCURRENCY_BOUNDED_CHANNEL_H_

View File

@@ -0,0 +1,57 @@
#include "bounded_channel.h"
template <typename T>
void BoundedChannel<T>::Send(T item) {
// Acquire exclusive ownership of the mutex; released automatically on scope exit.
std::unique_lock lock(mutex_);
// Block until there is space in the queue or the channel has been closed.
// The predicate guards against spurious wakeups.
not_full_.wait(lock, [&] { return queue_.size() < capacity_ || closed_; });
// If the channel was closed while waiting, discard the item and return.
if (closed_) return;
// Move the item into the queue to avoid an unnecessary copy.
queue_.push(std::move(item));
// Wake one blocked Receive() call to signal that data is now available.
not_empty_.notify_one();
}
template <typename T>
std::optional<T> BoundedChannel<T>::Receive() {
// Acquire exclusive ownership of the mutex.
std::unique_lock lock(mutex_);
// Block until the queue is non-empty or the channel has been closed.
// The predicate guards against spurious wakeups.
not_empty_.wait(lock, [&] { return !queue_.empty() || closed_; });
// If woken due to closure and no items remain, signal exhaustion via nullopt.
if (queue_.empty()) return std::nullopt;
// Move the front item out of the queue to avoid an unnecessary copy.
T item = std::move(queue_.front());
queue_.pop();
// Wake one blocked Send() call to signal that a slot has opened.
not_full_.notify_one();
return item;
}
template <typename T>
void BoundedChannel<T>::Close() {
// Acquire exclusive ownership of the mutex to ensure visibility of the flag.
std::unique_lock lock(mutex_);
// Mark the channel as closed; subsequent Send() calls will be dropped.
closed_ = true;
// Wake all blocked Send() callers so they can observe the closed flag and exit.
not_full_.notify_all();
// Wake all blocked Receive() callers so they can drain remaining items or return nullopt.
not_empty_.notify_all();
}

View File

@@ -8,9 +8,7 @@
#include <string> #include <string>
#include "data_model/brewery_result.h" #include "data_model/generated_models.h"
#include "data_model/location.h"
#include "data_model/user_result.h"
/** /**
* @brief Interface for data generator implementations. * @brief Interface for data generator implementations.

View File

@@ -14,9 +14,11 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "../services/prompting/prompt_directory.h"
#include "data_generation/data_generator.h" #include "data_generation/data_generator.h"
#include "data_generation/prompt_formatting/prompt_formatter.h" #include "data_generation/prompt_formatting/prompt_formatter.h"
#include "data_model/application_options.h" #include "data_model/models.h"
#include "services/logging/logger.h"
struct llama_model; struct llama_model;
struct llama_context; struct llama_context;
@@ -33,10 +35,12 @@ class LlamaGenerator final : public DataGenerator {
* @param options Parsed application options. * @param options Parsed application options.
* @param model_path Filesystem path to GGUF model assets. * @param model_path Filesystem path to GGUF model assets.
* @param prompt_formatter Formatter that produces model-specific prompts. * @param prompt_formatter Formatter that produces model-specific prompts.
* @param prompt_directory Directory service for loading named prompt files.
*/ */
LlamaGenerator(const ApplicationOptions& options, LlamaGenerator(const ApplicationOptions& options,
const std::string& model_path, const std::string& model_path, std::shared_ptr<ILogger> logger,
std::unique_ptr<IPromptFormatter> prompt_formatter); std::unique_ptr<IPromptFormatter> prompt_formatter,
std::unique_ptr<IPromptDirectory> prompt_directory);
~LlamaGenerator() override; ~LlamaGenerator() override;
@@ -119,15 +123,6 @@ class LlamaGenerator final : public DataGenerator {
int max_tokens = kDefaultMaxTokens, int max_tokens = kDefaultMaxTokens,
std::string_view grammar = {}); std::string_view grammar = {});
/**
* @brief Loads the brewery system prompt from disk.
*
* @param prompt_file_path Prompt file path to try first.
* @return Loaded prompt text.
*/
std::string LoadBrewerySystemPrompt(
const std::filesystem::path& prompt_file_path);
ModelHandle model_; ModelHandle model_;
ContextHandle context_; ContextHandle context_;
float sampling_temperature_ = 1.0F; float sampling_temperature_ = 1.0F;
@@ -135,8 +130,10 @@ class LlamaGenerator final : public DataGenerator {
uint32_t sampling_top_k_ = kDefaultSamplingTopK; uint32_t sampling_top_k_ = kDefaultSamplingTopK;
std::mt19937 rng_; std::mt19937 rng_;
uint32_t n_ctx_ = kDefaultContextSize; uint32_t n_ctx_ = kDefaultContextSize;
std::string brewery_system_prompt_; int n_gpu_layers_ = 0;
std::shared_ptr<ILogger> logger_;
std::unique_ptr<IPromptFormatter> prompt_formatter_; std::unique_ptr<IPromptFormatter> prompt_formatter_;
std::unique_ptr<IPromptDirectory> prompt_directory_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_LLAMA_GENERATOR_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_LLAMA_GENERATOR_H_

View File

@@ -12,7 +12,7 @@
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "data_model/brewery_result.h" #include "data_model/generated_models.h"
struct llama_vocab; struct llama_vocab;
using llama_token = int32_t; using llama_token = int32_t;

View File

@@ -44,6 +44,13 @@ class MockGenerator final : public DataGenerator {
*/ */
static size_t DeterministicHash(const Location& location); static size_t DeterministicHash(const Location& location);
// Hash stride constants for deterministic distribution across fixed-size
// arrays. These coprime strides spread hash values uniformly without
// clustering, ensuring diverse output across different hash inputs.
static constexpr size_t kNounHashStride = 7;
static constexpr size_t kDescriptionHashStride = 13;
static constexpr size_t kBioHashStride = 11;
static constexpr std::array<std::string_view, 18> kBreweryAdjectives = { static constexpr std::array<std::string_view, 18> kBreweryAdjectives = {
"Craft", "Heritage", "Local", "Artisan", "Pioneer", "Golden", "Craft", "Heritage", "Local", "Artisan", "Pioneer", "Golden",
"Modern", "Classic", "Summit", "Northern", "Riverstone", "Barrel", "Modern", "Classic", "Summit", "Northern", "Riverstone", "Barrel",

View File

@@ -1,4 +1,5 @@
#pragma once #ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -13,3 +14,5 @@ class Gemma4JinjaPromptFormatter final : public IPromptFormatter {
[[nodiscard]] std::string Format(std::string_view system_prompt, [[nodiscard]] std::string Format(std::string_view system_prompt,
std::string_view user_prompt) const override; std::string_view user_prompt) const override;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_

View File

@@ -1,4 +1,5 @@
#pragma once #ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#include <string> #include <string>
#include <string_view> #include <string_view>
@@ -15,3 +16,5 @@ class IPromptFormatter {
[[nodiscard]] virtual std::string Format( [[nodiscard]] virtual std::string Format(
std::string_view system_prompt, std::string_view user_prompt) const = 0; std::string_view system_prompt, std::string_view user_prompt) const = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_

View File

@@ -1,42 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
/**
* @file data_model/application_options.h
* @brief Program options for the Biergarten pipeline application.
*/
#include <cstdint>
#include <string>
/**
* @brief Program options for the Biergarten pipeline application.
*/
struct ApplicationOptions {
/// @brief Path to the LLM model file (gguf format); mutually exclusive with
/// use_mocked.
std::string model_path;
/// @brief Use mocked generator instead of LLM; mutually exclusive with
/// model_path.
bool use_mocked = false;
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter (0.0 to 1.0, higher = more
/// random).
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens) for LLM inference. Higher values
/// support longer prompts but use more memory.
uint32_t n_ctx = 8192;
/// @brief Random seed for sampling (-1 for random, otherwise non-negative).
int seed = -1;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_

View File

@@ -1,22 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
/**
* @file data_model/brewery_location.h
* @brief Non-owning brewery location input.
*/
#include <string_view>
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_

View File

@@ -1,28 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
/**
* @file data_model/brewery_result.h
* @brief Generated brewery payload.
*/
#include <string>
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_

View File

@@ -1,21 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
/**
* @file data_model/enriched_city.h
* @brief Enriched city data with Wikipedia context.
*/
#include <string>
#include "data_model/location.h"
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_

View File

@@ -1,20 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
/**
* @file data_model/generated_brewery.h
* @brief Helper struct to store generated brewery data.
*/
#include "data_model/brewery_result.h"
#include "data_model/location.h"
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_

View File

@@ -0,0 +1,66 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
/**
* @file data_model/generated_models.h
* @brief Generated output models from the pipeline: brewery/user results, enriched data,
* and complete generation results.
*/
#include <string>
#include "data_model/models.h"
// ============================================================================
// Generation Output Models
// ============================================================================
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
// ============================================================================
// Pipeline Data Models
// ============================================================================
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_

View File

@@ -1,13 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
/**
* @file data_model/generation_models.h
* @brief Convenience include for shared generation payload models.
*/
#include "data_model/brewery_location.h"
#include "data_model/brewery_result.h"
#include "data_model/user_result.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_

View File

@@ -1,41 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
/**
* @file data_model/location.h
* @brief Location data model used throughout generation pipeline.
*/
#include <string>
#include <vector>
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_

View File

@@ -0,0 +1,145 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
/**
* @file data_model/models.h
* @brief Core data models: locations, application configuration, and generation
* inputs.
*/
#include <boost/program_options.hpp>
#include <cstdint>
#include <filesystem>
#include <memory>
#include <optional>
#include <string>
#include <string_view>
#include <vector>
class ILogger;
namespace prog_opts = boost::program_options;
// ============================================================================
// Location Models
// ============================================================================
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
// ============================================================================
// Configuration Models
// ============================================================================
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
/// @brief Number of layers to offload to GPU.
int n_gpu_layers = 0;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
/// @brief Number of locations to sample from the dataset
/// More locations -> more users/more breweries
uint32_t location_count;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
// ============================================================================
// Function Declarations
// ============================================================================
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv,
std::shared_ptr<ILogger> logger = nullptr);
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_

View File

@@ -1,12 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
/**
* @file data_model/pipeline_models.h
* @brief Convenience include for pipeline-specific data models.
*/
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_

View File

@@ -1,22 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
/**
* @file data_model/user_result.h
* @brief Generated user profile payload.
*/
#include <string>
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_

View File

@@ -7,16 +7,19 @@
*/ */
#include <filesystem> #include <filesystem>
#include <memory>
#include <vector> #include <vector>
#include "data_model/location.h" #include "data_model/models.h"
#include "services/logging/logger.h"
/// @brief Loads curated world locations from a JSON file into memory. /// @brief Loads curated world locations from a JSON file into memory.
class JsonLoader { class JsonLoader {
public: public:
/// @brief Parses a JSON array file and returns all location records. /// @brief Parses a JSON array file and returns all location records.
static std::vector<Location> LoadLocations( static std::vector<Location> LoadLocations(
const std::filesystem::path& filepath); const std::filesystem::path& filepath,
std::shared_ptr<ILogger> logger = nullptr);
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_JSON_LOADER_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_JSON_LOADER_H_

View File

@@ -0,0 +1,109 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_PRETTY_PRINT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_JSON_HANDLING_PRETTY_PRINT_H_
/**
* @file json_handling/pretty_print.h
* @brief Pretty-printing utilities for JSON values.
*
* Provides formatting capability for boost::json::value with indentation and
* readable output. Adapted from Boost JSON library examples.
*/
#include <boost/json.hpp>
#include <ostream>
#include <string>
/**
* @brief Pretty-prints a JSON value to an output stream with indentation.
*
* Recursively formats JSON objects and arrays with consistent 4-space
* indentation. Adapted from:
* https://raw.githubusercontent.com/boostorg/json/refs/heads/develop/example/pretty.cpp
*
* @param outstream Output stream to write formatted JSON.
* @param json_val JSON value to format.
* @param indent Optional indentation string (managed internally on first call).
*/
inline void PrettyPrint(std::ostream& outstream,
boost::json::value const& json_val,
std::string* indent = nullptr) {
std::string str;
if (indent == nullptr) {
indent = &str;
}
switch (json_val.kind()) {
case boost::json::kind::object: {
outstream << "{\n";
indent->append(4, ' ');
auto const& obj = json_val.get_object();
if (!obj.empty()) {
const auto* iter = obj.begin();
for (;;) {
outstream << *indent << boost::json::serialize(iter->key()) << " : ";
PrettyPrint(outstream, iter->value(), indent);
iter = std::next(iter);
if (iter == obj.end()) {
break;
}
outstream << ",\n";
}
}
outstream << "\n";
indent->resize(indent->size() - 4);
outstream << *indent << "}";
break;
}
case boost::json::kind::array: {
outstream << "[\n";
indent->append(4, ' ');
auto const& arr = json_val.get_array();
if (!arr.empty()) {
const auto* iter = arr.begin();
for (;;) {
outstream << *indent;
PrettyPrint(outstream, *iter, indent);
iter = std::next(iter);
if (iter == arr.end()) {
break;
}
outstream << ",\n";
}
}
outstream << "\n";
indent->resize(indent->size() - 4);
outstream << *indent << "]";
break;
}
case boost::json::kind::string: {
outstream << serialize(json_val.get_string());
break;
}
case boost::json::kind::uint64:
case boost::json::kind::int64:
case boost::json::kind::double_:
outstream << json_val;
break;
case boost::json::kind::bool_:
if (json_val.get_bool()) {
outstream << "true";
} else {
outstream << "false";
}
break;
case boost::json::kind::null:
outstream << "null";
break;
}
if (indent->empty()) {
outstream << "\n";
}
}
#endif

View File

@@ -1,12 +1,14 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
/** /**
* @file services/export_service.h * @file services/export_service.h
* @brief Abstraction for persisting generated brewery data. * @brief Abstraction for persisting generated brewery data.
*/ */
#include "data_model/generated_brewery.h" #include <cstdint>
#include "data_model/generated_models.h"
/** /**
* @brief Interface for services that persist generated brewery records. * @brief Interface for services that persist generated brewery records.
@@ -37,4 +39,4 @@ class IExportService {
virtual void Finalize() = 0; virtual void Finalize() = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
/** /**
* @file services/sqlite_connection_helpers.h * @file services/sqlite_connection_helpers.h
@@ -7,11 +7,12 @@
*/ */
#include <sqlite3.h> #include <sqlite3.h>
#include <filesystem> #include <filesystem>
#include <string> #include <string>
#include <string_view> #include <string_view>
#include "services/sqlite_handle_types.h" #include "sqlite_handle_types.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -26,6 +27,4 @@ void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept;
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
/** /**
* @file services/sqlite_export_service.h * @file services/sqlite_export_service.h
@@ -11,16 +11,17 @@
#include <string> #include <string>
#include <unordered_map> #include <unordered_map>
#include "services/date_time_provider.h" #include "data_model/models.h"
#include "services/export_service.h" #include "../datetime/date_time_provider.h"
#include "services/sqlite_export_service_helpers.h" #include "export_service.h"
#include "sqlite_export_service_helpers.h"
/** /**
* @brief Persists generated brewery records into a fresh SQLite database. * @brief Persists generated brewery records into a fresh SQLite database.
*/ */
class SqliteExportService final : public IExportService { class SqliteExportService final : public IExportService {
public: public:
SqliteExportService(); explicit SqliteExportService(const ApplicationOptions& options);
~SqliteExportService() override; ~SqliteExportService() override;
SqliteExportService(const SqliteExportService&) = delete; SqliteExportService(const SqliteExportService&) = delete;
@@ -41,12 +42,12 @@ class SqliteExportService final : public IExportService {
void InitializeSchema() const; void InitializeSchema() const;
void PrepareStatements(); void PrepareStatements();
void RollbackAndCloseNoThrow() noexcept; void RollbackAndCloseNoThrow() noexcept;
void FinalizeStatements() noexcept;
[[nodiscard]] std::filesystem::path BuildDatabasePath() const; [[nodiscard]] std::filesystem::path BuildDatabasePath() const;
[[nodiscard]] static std::string BuildLocationKey(const Location& location); [[nodiscard]] static std::string BuildLocationKey(const Location& location);
std::unique_ptr<IDateTimeProvider> date_time_provider_; std::unique_ptr<IDateTimeProvider> date_time_provider_;
std::filesystem::path output_path_;
std::string run_timestamp_utc_; std::string run_timestamp_utc_;
std::filesystem::path database_path_; std::filesystem::path database_path_;
SqliteDatabaseHandle db_handle_; SqliteDatabaseHandle db_handle_;
@@ -56,4 +57,4 @@ class SqliteExportService final : public IExportService {
std::unordered_map<std::string, sqlite3_int64> location_cache_; std::unordered_map<std::string, sqlite3_int64> location_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_

View File

@@ -0,0 +1,10 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "sqlite_connection_helpers.h"
#include "sqlite_handle_types.h"
#include "sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,11 +1,12 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
/** /**
* Shared handle and parameter type declarations used by SQLite helper units. * Shared handle and parameter type declarations used by SQLite helper units.
*/ */
#include <sqlite3.h> #include <sqlite3.h>
#include <memory> #include <memory>
#include <string_view> #include <string_view>
@@ -32,5 +33,4 @@ struct BindParam {
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_

View File

@@ -1,17 +1,19 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
/** /**
* @file services/sqlite_statement_helpers.h * @file services/sqlite_statement_helpers.h
* @brief Declarations for statement-level SQLite helper functions and constants. * @brief Declarations for statement-level SQLite helper functions and
* constants.
*/ */
#include <sqlite3.h> #include <sqlite3.h>
#include <string> #include <string>
#include <string_view> #include <string_view>
#include <vector> #include <vector>
#include "services/sqlite_handle_types.h" #include "sqlite_handle_types.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -107,10 +109,8 @@ void StepStatement(const SqliteDatabaseHandle& db_handle,
sqlite3_int64 LastInsertRowId(const SqliteDatabaseHandle& db_handle); sqlite3_int64 LastInsertRowId(const SqliteDatabaseHandle& db_handle);
std::string SerializeLocalLanguages(const std::vector<std::string>& local_languages);
std::string SerializeVector(const std::vector<std::string>& str_vec); std::string SerializeVector(const std::vector<std::string>& str_vec);
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
/** /**
* @file services/date_time_provider.h * @file services/date_time_provider.h
@@ -63,4 +63,4 @@ class SystemDateTimeProvider final : public IDateTimeProvider {
} }
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_

View File

@@ -0,0 +1,35 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#include <chrono>
/**
* @file services/timer.h
* @brief Simple timer utility for measuring elapsed time.
*/
class Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
public:
Timer(const Timer&) = delete;
Timer& operator=(const Timer&) = delete;
Timer(Timer&&) = delete;
Timer& operator=(Timer&&) = delete;
Timer() = default;
~Timer() = default;
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
[[nodiscard]] int64_t Reset() {
auto previous_elapsed = Elapsed();
start_time = std::chrono::steady_clock::now();
return previous_elapsed;
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
/** /**
* @file services/enrichment_service.h * @file services/enrichment_service.h
@@ -8,7 +8,7 @@
#include <string> #include <string>
#include "data_model/location.h" #include "data_model/models.h"
/** /**
* @brief Interface for services that can enrich a location with context. * @brief Interface for services that can enrich a location with context.
@@ -27,4 +27,4 @@ class IEnrichmentService {
virtual std::string GetLocationContext(const Location& loc) = 0; virtual std::string GetLocationContext(const Location& loc) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_

View File

@@ -0,0 +1,17 @@
//
// Created by aaronpo on 13/05/2026.
//
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#include <string>
#include "enrichment_service.h"
class MockEnrichmentService final : public IEnrichmentService {
public:
std::string GetLocationContext(const Location& /*loc*/) override {
return {};
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_ #ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_ #define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
/** /**
* @file services/wikipedia_service.h * @file services/wikipedia_service.h
@@ -11,14 +11,16 @@
#include <string_view> #include <string_view>
#include <unordered_map> #include <unordered_map>
#include "services/enrichment_service.h" #include "enrichment_service.h"
#include "services/logging/logger.h"
#include "web_client/web_client.h" #include "web_client/web_client.h"
/// @brief Provides Wikipedia summary lookups backed by cached raw extracts. /// @brief Provides Wikipedia summary lookups backed by cached raw extracts.
class WikipediaService final : public IEnrichmentService { class WikipediaEnrichmentService final : public IEnrichmentService {
public: public:
/// @brief Creates a new Wikipedia service with the provided web client. /// @brief Creates a new Wikipedia service with the provided web client.
explicit WikipediaService(std::unique_ptr<WebClient> client); explicit WikipediaEnrichmentService(std::unique_ptr<WebClient> client,
std::shared_ptr<ILogger> logger);
/// @brief Returns the Wikipedia-derived context for a location. /// @brief Returns the Wikipedia-derived context for a location.
[[nodiscard]] std::string GetLocationContext(const Location& loc) override; [[nodiscard]] std::string GetLocationContext(const Location& loc) override;
@@ -26,8 +28,9 @@ class WikipediaService final : public IEnrichmentService {
private: private:
std::string FetchExtract(std::string_view query); std::string FetchExtract(std::string_view query);
std::unique_ptr<WebClient> client_; std::unique_ptr<WebClient> client_;
std::shared_ptr<ILogger> logger_;
/// @brief Canonical cache for raw Wikipedia query extracts. /// @brief Canonical cache for raw Wikipedia query extracts.
std::unordered_map<std::string, std::string> extract_cache_; std::unordered_map<std::string, std::string> extract_cache_;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_

View File

@@ -0,0 +1,53 @@
/**
* @file services/logging/log_dispatcher.h
* @brief Dedicated log dispatcher for asynchronous pipeline logging.
*
* The dispatcher drains LogEntry values from a bounded channel and forwards
* them to spdlog on a dedicated thread.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_DISPATCHER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_DISPATCHER_H_
#include <spdlog/spdlog.h>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
/**
* @class LogDispatcher
* @brief Consumes log entries from a channel and forwards them to spdlog.
*
* Non-copyable and non-movable. Intended to run on its own dedicated thread
* and exit once the channel has been closed and drained.
*/
class LogDispatcher {
public:
/**
* @brief Construct a log dispatcher.
*
* @param channel Reference to the bounded channel used for log retrieval.
*/
explicit LogDispatcher(BoundedChannel<LogEntry>& channel);
LogDispatcher(const LogDispatcher&) = delete;
LogDispatcher& operator=(const LogDispatcher&) = delete;
LogDispatcher(LogDispatcher&&) = delete;
LogDispatcher& operator=(LogDispatcher&&) = delete;
~LogDispatcher() = default;
/**
* @brief Drain the channel and forward entries to spdlog.
*
* Intended to be called once on a dedicated thread. The loop returns after
* the channel has been closed and all queued entries have been processed.
*/
void Run();
private:
BoundedChannel<LogEntry>& channel_;
static spdlog::level::level_enum ToSpdlogLevel(LogLevel level);
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_DISPATCHER_H_

View File

@@ -0,0 +1,88 @@
/**
* @file services/logging/log_entry.h
* @brief Structured log record shared by the pipeline logging infra.
*
* LogEntry is a lightweight value type that can be passed safely between the
* logging producer and dispatcher through BoundedChannel<LogEntry>.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_ENTRY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_ENTRY_H_
#include <chrono>
#include <source_location>
#include <string>
#include <thread>
#include <vector>
/**
* @enum LogLevel
* @brief Severity levels supported by the logging infra.
*/
enum class LogLevel {
Debug, ///< Development/debugging information.
Info, ///< General informational messages.
Warn, ///< Warning conditions.
Error, ///< Error conditions.
};
/**
* @enum PipelinePhase
* @brief Pipeline execution phases used to tag log records.
*
* The phase tag makes it easier to correlate log output with the part of the
* pipeline that emitted it.
*/
enum class PipelinePhase {
Startup, ///< Initialization and validation.
UserGeneration, ///< User profile generation.
BreweryAndBeerGeneration, ///< Brewery and beer data generation.
CheckinGeneration, ///< Checkin (visit) record generation.
RatingGeneration, ///< Rating and review generation.
FollowGeneration, ///< Follow relationship generation.
Teardown, ///< Finalization and cleanup.
};
/**
* @struct LogDTO
* @brief User-provided subset of log fields. Used to capture call-site info transparently.
*/
struct LogDTO {
LogLevel level;
PipelinePhase phase;
std::string message;
};
/**
* @struct LogEntry
* @brief Single structured log event.
*
* All fields are value types, which keeps transfer across the bounded channel
* simple and avoids shared ownership.
*
* NOTE: timestamp, thread_id, and origin must be populated by ILogger::Log()
* before the entry is dispatched.
*/
struct LogEntry {
/// @brief Timestamp when the entry was created.
std::chrono::system_clock::time_point timestamp{};
/// @brief Source location where the log call was made.
std::source_location origin{};
/// @brief Thread responsible for emitting the log.
std::thread::id thread_id{};
/// @brief Severity level of this entry.
LogLevel level;
/// @brief Pipeline phase associated with the entry.
PipelinePhase phase;
/// @brief Log message text.
std::string message;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOG_ENTRY_H_

View File

@@ -0,0 +1,53 @@
/**
* @file services/logging/log_producer.h
* @brief Channel-backed log producer for asynchronous pipeline logging.
*
* The producer captures log records from application code and forwards them to
* a bounded channel for later processing by the dispatcher.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_CHANNEL_LOGGER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_CHANNEL_LOGGER_H_
#include <string_view>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
#include "services/logging/logger.h"
/**
* @class LogProducer
* @brief ILogger implementation that forwards entries to a bounded channel.
*
* Non-copyable and non-movable. The channel reference is non-owning and must
* remain valid for the lifetime of the producer.
*/
class LogProducer final : public ILogger {
public:
/**
* @brief Construct a channel-backed producer.
*
* @param channel Reference to the bounded channel used for log transfer.
*/
explicit LogProducer(BoundedChannel<LogEntry>& channel);
LogProducer(const LogProducer&) = delete;
LogProducer& operator=(const LogProducer&) = delete;
LogProducer(LogProducer&&) = delete;
LogProducer& operator=(LogProducer&&) = delete;
~LogProducer() override = default;
/**
* @brief Queue a log message for asynchronous processing.
*
* Blocks while the channel applies backpressure. This blocking behavior
* under heavy load is an accepted trade-off for simplicity.
*/
void DoLog(LogEntry log_entry) override;
private:
BoundedChannel<LogEntry>& channel_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_CHANNEL_LOGGER_H_

View File

@@ -0,0 +1,64 @@
/**
* @file services/logging/logger.h
* @brief Abstract logging interface used by pipeline components.
*
* The interface keeps application code independent from the concrete logging
* transport, buffering, and formatting implementation.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOGGER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOGGER_H_
#include <source_location>
#include <string>
#include <utility>
#include "services/logging/log_entry.h"
/**
* @class ILogger
* @brief Minimal interface for submitting structured log messages.
*
* Implementations are non-copyable and non-movable. They are typically owned
* by the composition root and injected into services that emit diagnostics.
*/
class ILogger {
public:
ILogger() = default;
ILogger(const ILogger&) = delete;
ILogger& operator=(const ILogger&) = delete;
ILogger(ILogger&&) = delete;
ILogger& operator=(ILogger&&) = delete;
virtual ~ILogger() = default;
/**
* @brief Submit a log message to the logging subsystem.
*
* @param payload User-provided log data (level, phase, message).
* @param origin Auto-captured source location of the call site.
*/
void Log(LogDTO payload,
std::source_location origin = std::source_location::current(),
std::chrono::system_clock::time_point timestamp = std::chrono::system_clock::now(),
std::thread::id thread_id = std::this_thread::get_id()) {
LogEntry entry;
entry.timestamp = timestamp;
entry.thread_id = thread_id;
entry.level = payload.level;
entry.phase = payload.phase;
entry.message = std::move(payload.message);
entry.origin = origin;
DoLog(std::move(entry));
}
protected:
/**
* @brief Underlying implementation to transport the log entry.
*
* Implementations must be thread-safe as DoLog can be called concurrently
* from multiple worker threads.
*/
virtual void DoLog(LogEntry log_entry) = 0;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_LOGGING_LOGGER_H_

View File

@@ -0,0 +1,82 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
/**
* @file services/prompt_directory.h
* @brief Interface and filesystem-backed implementation for named prompt
* loading.
*
* Prompt files are resolved by key: a key of "BREWERY_GENERATION" maps to the
* file <prompt_dir>/BREWERY_GENERATION.md. The interface is kept intentionally
* narrow so test doubles can be injected without touching the filesystem.
*/
#include <filesystem>
#include <memory>
#include <stdexcept>
#include <string>
#include <string_view>
#include <unordered_map>
#include "services/logging/logger.h"
/**
* @brief Interface for loading named prompt files.
*/
class IPromptDirectory {
public:
IPromptDirectory() = default;
IPromptDirectory(const IPromptDirectory&) = delete;
IPromptDirectory& operator=(const IPromptDirectory&) = delete;
IPromptDirectory(IPromptDirectory&&) = delete;
IPromptDirectory& operator=(IPromptDirectory&&) = delete;
virtual ~IPromptDirectory() = default;
/**
* @brief Loads the prompt associated with @p key.
*
* @param key Logical prompt key, e.g. "BREWERY_GENERATION".
* @return Prompt text.
* @throws std::runtime_error if the prompt file cannot be found or read.
*/
[[nodiscard]] virtual std::string Load(std::string_view key) = 0;
};
/**
* @brief Filesystem-backed IPromptDirectory implementation.
*
* Each call to Load() checks an in-process cache first, then reads
* <prompt_dir>/<key>.md from disk. The directory must exist and be readable
* at construction time; individual file absence is reported lazily at Load().
*/
class PromptDirectory final : public IPromptDirectory {
public:
/**
* @brief Constructs a PromptDirectory rooted at @p prompt_dir.
*
* @param prompt_dir Absolute or relative path to the prompt directory.
* @throws std::runtime_error if @p prompt_dir does not exist or is not a
* directory.
*/
explicit PromptDirectory(const std::filesystem::path& prompt_dir);
PromptDirectory(const std::filesystem::path& prompt_dir,
std::shared_ptr<ILogger> logger);
/**
* @brief Loads the prompt for @p key, caching the result.
*
* Maps @p key → <prompt_dir>/<key>.md.
*
* @param key Logical prompt key.
* @return Prompt text.
* @throws std::runtime_error if the file does not exist or is empty.
*/
[[nodiscard]] std::string Load(std::string_view key) override;
private:
std::filesystem::path prompt_dir_;
std::shared_ptr<ILogger> logger_;
std::unordered_map<std::string, std::string> cache_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_

View File

@@ -1,10 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "services/sqlite_handle_types.h"
#include "services/sqlite_connection_helpers.h"
#include "services/sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,54 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
/**
* @file web_client/curl_web_client.h
* @brief libcurl-based WebClient implementation.
*/
#include "web_client/web_client.h"
/**
* @brief RAII wrapper for curl_global_init and curl_global_cleanup.
*
* Create one instance in application startup before using libcurl and keep it
* alive for application lifetime.
*/
class CurlGlobalState {
public:
/// @brief Initializes global libcurl state.
CurlGlobalState();
/// @brief Cleans up global libcurl state.
~CurlGlobalState();
/// @brief Non-copyable type.
CurlGlobalState(const CurlGlobalState&) = delete;
/// @brief Non-copyable type.
CurlGlobalState& operator=(const CurlGlobalState&) = delete;
};
/**
* @brief WebClient implementation backed by libcurl.
*/
class CURLWebClient : public WebClient {
public:
/**
* @brief Executes an HTTP GET request.
*
* @param url Request URL.
* @return Response body.
*/
std::string Get(const std::string& url) override;
/**
* @brief URL-encodes a string value.
*
* @param value Raw value.
* @return URL-encoded string.
*/
std::string UrlEncode(const std::string& value) override;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_

View File

@@ -0,0 +1,56 @@
/**
* @file web_client/http_web_client.h
* @brief cpp-httplib implementation of the WebClient interface.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#include "web_client/web_client.h"
#include "services/logging/logger.h"
#include <memory>
#include <string>
#include <utility>
/**
* @brief WebClient implementation backed by cpp-httplib.
*
* Supports HTTP and HTTPS (requires OpenSSL; see HTTPLIB_REQUIRE_OPENSSL
* in CMakeLists.txt).
*
* URL parsing splits a full URL into origin (scheme://host[:port]) and
* path + query so that httplib::Client can be constructed correctly.
* A new client instance is created per request because the client is
* bound to a single origin at construction time.
*/
class HttpWebClient final : public WebClient {
public:
explicit HttpWebClient(std::shared_ptr<ILogger> logger)
: logger_(std::move(logger)) {}
~HttpWebClient() override = default;
/**
* @brief Executes a blocking HTTP/HTTPS GET request against a full URL.
*
* @param url Fully-qualified URL, e.g. "https://en.wikipedia.org/api/rest_v1/page/summary/Berlin"
* @return Response body on HTTP 2xx; throws std::runtime_error otherwise.
*/
std::string Get(const std::string& url) override;
/**
* @brief Percent-encodes a single URI component (query parameter value or
* path segment). Delegates to httplib::encode_uri_component().
*
* @param value Raw string to encode.
* @return Percent-encoded string safe for use in a URL.
*/
std::string EncodeURL(const std::string& value) override;
private:
std::shared_ptr<ILogger> logger_;
};
#endif

View File

@@ -30,7 +30,7 @@ class WebClient {
* @param value Raw string value. * @param value Raw string value.
* @return Encoded value safe for URL usage. * @return Encoded value safe for URL usage.
*/ */
virtual std::string UrlEncode(const std::string& value) = 0; virtual std::string EncodeURL(const std::string& value) = 0;
}; };
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_ #endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_

View File

@@ -0,0 +1,9 @@
# Ignore model files!
*.gguf
*.bin
models/
weights/
# Ignore local build folders
build/
.git/

View File

@@ -0,0 +1,72 @@
# --- Stage 1: Build Environment (The "Heavy" Stage) ---
FROM nvidia/cuda:12.6.3-devel-ubuntu24.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive \
CMAKE_GENERATOR=Ninja
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential ca-certificates curl git libboost-json-dev \
libboost-program-options-dev libssl-dev ninja-build pkg-config zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install modern CMake
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.31.0/cmake-3.31.0-linux-x86_64.sh -o cmake.sh && \
sh cmake.sh --skip-license --prefix=/usr/local && rm cmake.sh
# Get headers for C++ build
RUN curl -L https://github.com/ggml-org/llama.cpp/archive/refs/tags/b9012.tar.gz -o /tmp/llama-src.tar.gz && \
tar -xzf /tmp/llama-src.tar.gz -C /tmp && \
cp -r /tmp/llama.cpp-b9012/include/* /usr/local/include/ && \
cp -r /tmp/llama.cpp-b9012/ggml/include/* /usr/local/include/
# Pull llama.cpp binaries to use during build if needed
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
WORKDIR /app
COPY . .
# Build the C++ pipeline
RUN cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc)
# --- Stage 2: Runtime Environment (The "Slim" Stage) ---
FROM nvidia/cuda:12.6.3-runtime-ubuntu24.04 AS runtime
# Install only necessary runtime shared libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
libboost-json1.83.0 \
libboost-program-options1.83.0 \
libgomp1 \
libssl3 \
zlib1g \
&& rm -rf /var/lib/apt/lists/*
ENV APP_ROOT=/app \
LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"
WORKDIR /app/build
# Copy only the compiled binaries from the builder
COPY --from=builder /app/build/biergarten-pipeline ./
# Copy required config files
COPY locations.json /app/build/
COPY beer-styles.json /app/build/
# Copy prompt templates
COPY prompts /app/prompts
# Copy only the necessary shared libraries from builder/llama-bin
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
# Co-locate plugins
RUN cp /usr/local/lib/libggml-cuda.so . 2>/dev/null || true && \
cp /usr/local/lib/libggml-cpu*.so . 2>/dev/null || true
# Setup Start Script
COPY ./runpod/start.sh /usr/local/bin/biergarten-start
RUN chmod +x /usr/local/bin/biergarten-start
ENTRYPOINT ["/usr/local/bin/biergarten-start"]

View File

@@ -0,0 +1,8 @@
```bash
touch runpod/start.sh
docker build \
--progress=plain \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```

View File

@@ -0,0 +1,22 @@
name: biergarten-pipeline-live
imageName: biergarten-pipeline:latest
category: NVIDIA
containerDiskInGb: 50
volumeInGb: 50
volumeMountPath: /workspace
dockerEntrypoint:
- /usr/local/bin/biergarten-start
dockerStartCmd: []
isPublic: false
isServerless: false
env:
BIERGARTEN_MODE: live
BIERGARTEN_MODEL_PATH: /workspace/models/google_gemma-4-E4B-it-Q6_K.gguf
BIERGARTEN_PROMPT_DIR: /workspace/app/build/prompts
BIERGARTEN_OUTPUT_DIR: /workspace/output
BIERGARTEN_LOG_PATH: /workspace/logs/pipeline.log
BIERGARTEN_TEMPERATURE: "1.0"
BIERGARTEN_TOP_P: "0.95"
BIERGARTEN_TOP_K: "64"
BIERGARTEN_N_CTX: "8192"
BIERGARTEN_SEED: "-1"

View File

@@ -0,0 +1,58 @@
#!/bin/bash
set -e
MODEL_PATH="${BIERGARTEN_MODEL_PATH:-/workspace/models/google_gemma-4-E4B-it-Q6_K.gguf}"
OUTPUT_DIR="${BIERGARTEN_OUTPUT_DIR:-/workspace/output}"
LOG_PATH="${BIERGARTEN_LOG_PATH:-/workspace/logs/pipeline.log}"
EXECUTABLE="/app/build/biergarten-pipeline"
PROMPT_DIR="/app/prompts"
echo "--- Starting Biergarten Pipeline Environment Check ---"
# Ensure directories exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$(dirname "$LOG_PATH")"
mkdir -p "$(dirname "$MODEL_PATH")"
# Download model if missing
if [ ! -f "$MODEL_PATH" ]; then
echo "Model not found. Downloading (this may take a while)..."
curl -L -C - \
-o "$MODEL_PATH" \
"https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true"
echo "Download complete."
fi
# Verify model exists
if [ ! -f "$MODEL_PATH" ]; then
echo "ERROR: Model still not found after download attempt."
exit 1
fi
# Default GPU layers
GL_LAYERS="${BIERGARTEN_GL_LAYERS:-40}"
# Build args
ARGS=(
"--model" "$MODEL_PATH"
"--prompt-dir" "$PROMPT_DIR"
"--output" "$OUTPUT_DIR"
"--log-path" "$LOG_PATH"
"--n-gpu-layers" "$GL_LAYERS"
)
# Optional params
[[ -n "$BIERGARTEN_TEMPERATURE" ]] && ARGS+=("--temperature" "$BIERGARTEN_TEMPERATURE")
[[ -n "$BIERGARTEN_TOP_P" ]] && ARGS+=("--top-p" "$BIERGARTEN_TOP_P")
[[ -n "$BIERGARTEN_TOP_K" ]] && ARGS+=("--top-k" "$BIERGARTEN_TOP_K")
[[ -n "$BIERGARTEN_N_CTX" ]] && ARGS+=("--n-ctx" "$BIERGARTEN_N_CTX")
[[ -n "$BIERGARTEN_SEED" ]] && ARGS+=("--seed" "$BIERGARTEN_SEED")
# Extra args
[[ -n "$BIERGARTEN_EXTRA_ARGS" ]] && ARGS+=($BIERGARTEN_EXTRA_ARGS)
echo "--- Executing: $EXECUTABLE ${ARGS[*]} ---"
exec "$EXECUTABLE" "${ARGS[@]}"

View File

@@ -0,0 +1,214 @@
#include <chrono>
#include <format>
#include <iostream>
#include <optional>
#include <sstream>
#include <string>
#include "data_model/models.h"
#include "services/logging/logger.h"
std::optional<ApplicationOptions> ParseArguments(
const int argc, char** argv, std::shared_ptr<ILogger> logger) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Defaults sourced from SamplingOptions{} so the CLI and LlamaGenerator
// share a single source of truth — changing the struct updates both.
auto add_sampling_options = [&]() -> void {
const SamplingOptions sampling_defaults{};
opt("temperature",
prog_opts::value<float>()->default_value(sampling_defaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(sampling_defaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(sampling_defaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
opt("n-gpu-layers", prog_opts::value<int>()->default_value(0),
"Number of layers to offload to GPU");
};
// --mocked and --model are mutually exclusive; validation is enforced below
// rather than at registration to produce a clear diagnostic message.
auto add_generator_options = [&]() -> void {
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
};
auto add_pipeline_options = [&]() -> void {
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
opt("location-count", prog_opts::value<uint32_t>()->default_value(10));
};
add_sampling_options();
add_generator_options();
add_pipeline_options();
// No flags provided — treat as a help request rather than an error.
if (argc == 1) {
const std::string title = "Biergarten Pipeline";
const std::string usage = ([&] {
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
return usage_stream.str();
})();
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = title});
logger->Log(LogDTO{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = usage});
}
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = help_stream.str()});
}
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
options.pipeline.location_count = var_map["location-count"].as<uint32_t>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
const int n_gpu_layers = var_map["n-gpu-layers"].as<int>();
// Enforce mutual exclusivity before any further configuration is applied.
if (use_mocked && !model_path.empty()) {
const std::string msg =
"Invalid arguments: --mocked and --model are mutually exclusive";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
const std::string msg =
"Invalid arguments: either --mocked or --model must be specified";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
return std::nullopt;
}
// Prompt directory is only meaningful for live inference — the mock
// generator has no use for it and should not require it to be present.
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
const std::string msg =
"Invalid arguments: --prompt-dir is required when not using --mocked";
if (logger) {
logger->Log({.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
// options.generator.n_gpu_layers = n_gpu_layers;
// Only populate sampling config when the user explicitly overrides at
// least one value. Leaving it as std::nullopt lets LlamaGenerator fall
// back to its own SamplingOptions{} defaults, keeping the two paths
// consistent without redundant copies.
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted() || !var_map["n_gpu_layers"].defaulted();
if (user_provided_sampling) {
// Warn but do not fail — the run is still valid, the flags are just
// silently irrelevant when no model is loaded.
if (use_mocked) {
const std::string msg =
"Sampling parameters are ignored when using --mocked";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Warn,
.phase = PipelinePhase::Startup,
.message = msg});
} else {
std::cerr << msg << std::endl;
}
} else {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
sampling.n_gpu_layers = var_map["n-gpu-layers"].as<int>();
options.generator.sampling = sampling;
}
}
return options;
} catch (const std::exception& exception) {
const std::string msg =
std::string("Failed to parse command-line arguments: ") +
exception.what();
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
}
return std::nullopt;
} catch (...) {
const std::string msg =
"Failed to parse command-line arguments: unknown error";
if (logger) {
logger->Log(LogDTO{.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = msg});
}
return std::nullopt;
}
}

View File

@@ -1,16 +0,0 @@
/**
* @file biergarten_data_generator/biergarten_data_generator.cc
* @brief BiergartenDataGenerator constructor implementation.
*/
#include "biergarten_data_generator.h"
#include <utility>
BiergartenDataGenerator::BiergartenDataGenerator(
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter)
: context_service_(std::move(context_service)),
generator_(std::move(generator)),
exporter_(std::move(exporter)) {}

View File

@@ -1,58 +0,0 @@
/**
* @file biergarten_data_generator/generate_breweries.cc
* @brief BiergartenDataGenerator::GenerateBreweries() implementation.
*/
#include <spdlog/spdlog.h>
#include "biergarten_data_generator.h"
void BiergartenDataGenerator::GenerateBreweries(
std::span<const EnrichedCity> cities) {
spdlog::info("\n=== SAMPLE BREWERY GENERATION ===");
generated_breweries_.clear();
size_t skipped_count = 0;
size_t export_failed_count = 0;
for (const auto& [location, region_context] : cities) {
try {
const BreweryResult brewery =
generator_->GenerateBrewery(location, region_context);
const GeneratedBrewery gen{.location = location, .brewery = brewery};
generated_breweries_.push_back(gen);
try {
exporter_->ProcessRecord(gen);
} catch (const std::exception& export_exception) {
++export_failed_count;
spdlog::warn(
"[Pipeline] Generated brewery for '{}' ({}) but SQLite export "
"failed: {}",
location.city, location.country, export_exception.what());
}
} catch (const std::exception& e) {
++skipped_count;
spdlog::warn(
"[Pipeline] Skipping city '{}' ({}): brewery generation failed: "
"{}",
location.city, location.country, e.what());
}
}
if (skipped_count > 0) {
spdlog::warn("[Pipeline] Skipped {} city/cities due to generation errors",
skipped_count);
}
if (export_failed_count > 0) {
spdlog::warn(
"[Pipeline] Failed to export {} generated brewery/breweries to "
"SQLite",
export_failed_count);
}
}

View File

@@ -1,26 +0,0 @@
/**
* @file biergarten_data_generator/log_results.cc
* @brief BiergartenDataGenerator::LogResults() implementation.
*/
#include <spdlog/spdlog.h>
#include "biergarten_data_generator.h"
void BiergartenDataGenerator::LogResults() const {
spdlog::info("\n=== GENERATED DATA DUMP ===");
size_t index = 1;
for (const auto& [location, brewery] : generated_breweries_) {
spdlog::info(
"{}. city=\"{}\" country=\"{}\" state=\"{}\" "
"iso3166_2={} lat={} lon={}",
index, location.city, location.country, location.state_province,
location.iso3166_2, location.latitude, location.longitude);
spdlog::info(" brewery_name_en=\"{}\"", brewery.name_en);
spdlog::info(" brewery_description_en=\"{}\"", brewery.description_en);
spdlog::info(" brewery_name_local=\"{}\"", brewery.name_local);
spdlog::info(" brewery_description_local=\"{}\"",
brewery.description_local);
++index;
}
}

View File

@@ -1,41 +0,0 @@
/**
* @file biergarten_data_generator/query_cities_with_countries.cc
* @brief BiergartenDataGenerator::QueryCitiesWithCountries() implementation.
*/
#include <spdlog/spdlog.h>
#include <algorithm>
#include <filesystem>
#include <iterator>
#include <random>
#include "biergarten_data_generator.h"
#include "json_handling/json_loader.h"
static constexpr size_t kBreweryAmount = 50;
std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ===");
const std::filesystem::path locations_path = "locations.json";
auto all_locations = JsonLoader::LoadLocations(locations_path);
spdlog::info(" Locations available: {}", all_locations.size());
const size_t sample_count = std::min(kBreweryAmount, all_locations.size());
const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(
sample_count);
std::vector<Location> sampled_locations;
sampled_locations.reserve(sample_count);
std::random_device random_generator;
std::ranges::sample(all_locations, std::back_inserter(sampled_locations),
sample_count_signed, random_generator);
spdlog::info(" Sampled locations: {}", sampled_locations.size());
return sampled_locations;
}

View File

@@ -1,52 +0,0 @@
/**
* @file biergarten_data_generator/run.cc
* @brief BiergartenDataGenerator::Run() implementation.
*/
#include <spdlog/spdlog.h>
#include <utility>
#include "biergarten_data_generator.h"
bool BiergartenDataGenerator::Run() {
try {
exporter_->Initialize();
std::vector<Location> cities = QueryCitiesWithCountries();
std::vector<EnrichedCity> enriched;
enriched.reserve(cities.size());
size_t skipped_count = 0;
for (auto& city : cities) {
try {
std::string region_context = context_service_->GetLocationContext(city);
spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}",
city.city, city.country, region_context);
enriched.push_back(
EnrichedCity{.location = std::move(city),
.region_context = std::move(region_context)});
} catch (const std::exception& exception) {
++skipped_count;
spdlog::warn(
"[Pipeline] Skipping city '{}' ({}): context lookup failed: {}",
city.city, city.country, exception.what());
}
}
if (skipped_count > 0) {
spdlog::warn(
"[Pipeline] Skipped {} city/cities due to context lookup errors",
skipped_count);
}
this->GenerateBreweries(enriched);
exporter_->Finalize();
this->LogResults();
return true;
} catch (const std::exception& e) {
spdlog::error("Pipeline execution failed with error: {}", e.what());
return false;
}
}

View File

@@ -0,0 +1,20 @@
/**
* @file biergarten_pipeline_orchestrator/biergarten_pipeline_orchestrator.cc
* @brief BiergartenDataGenerator constructor implementation.
*/
#include "biergarten_pipeline_orchestrator.h"
#include <utility>
BiergartenPipelineOrchestrator::BiergartenPipelineOrchestrator(
std::shared_ptr<ILogger> logger,
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter,
const ApplicationOptions &app_options)
: logger_(std::move(logger)),
context_service_(std::move(context_service)),
generator_(std::move(generator)),
exporter_(std::move(exporter)),
application_options_(app_options) {}

View File

@@ -0,0 +1,68 @@
/**
* @file biergarten_pipeline_orchestrator/generate_breweries.cc
* @brief BiergartenDataGenerator::GenerateBreweries() implementation.
*/
#include <chrono>
#include <format>
#include "biergarten_pipeline_orchestrator.h"
#include "services/logging/logger.h"
void BiergartenPipelineOrchestrator::GenerateBreweries(
std::span<const EnrichedCity> cities) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = "=== SAMPLE BREWERY GENERATION ==="});
generated_breweries_.clear();
size_t skipped_count = 0;
size_t export_failed_count = 0;
for (const auto& [location, region_context] : cities) {
try {
const BreweryResult brewery =
generator_->GenerateBrewery(location, region_context);
const GeneratedBrewery gen{.location = location, .brewery = brewery};
generated_breweries_.push_back(gen);
try {
exporter_->ProcessRecord(gen);
} catch (const std::exception& export_exception) {
++export_failed_count;
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message =
std::format("[Pipeline] Generated brewery for '{}' ({}) but SQLite export failed: {}",
location.city, location.country, export_exception.what())});
}
} catch (const std::exception& e) {
++skipped_count;
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format("[Pipeline] Skipping city '{}' ({}): brewery generation failed: {}",
location.city, location.country, e.what())});
}
}
if (skipped_count > 0) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format(
"[Pipeline] Skipped {} city/cities due to generation errors",
skipped_count)});
}
if (export_failed_count > 0) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::Teardown,
.message = std::format(
"[Pipeline] Failed to export {} generated brewery/breweries to SQLite",
export_failed_count)});
}
}

View File

@@ -0,0 +1,37 @@
/**
* @file biergarten_pipeline_orchestrator/log_results.cc
* @brief BiergartenDataGenerator::LogResults() implementation.
*/
#include <boost/json/array.hpp>
#include <chrono>
#include <format>
#include "../../includes/json_handling/pretty_print.h"
#include "biergarten_pipeline_orchestrator.h"
#include "services/logging/logger.h"
void BiergartenPipelineOrchestrator::LogResults() const {
boost::json::array output;
for (const auto& [location, brewery] : generated_breweries_) {
output.push_back(boost::json::object{
{"name_en", brewery.name_en},
{"description_en", brewery.description_en},
{"name_local", brewery.name_local},
{"description_local", brewery.description_local},
{"location", boost::json::object{
{"city", location.city},
{"country", location.country},
{"state_province", location.state_province},
{"iso3166_2", location.iso3166_2},
{"latitude", location.latitude},
{"longitude", location.longitude},
}}});
}
std::ostringstream oss;
PrettyPrint(oss, output);
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Teardown,
.message = oss.str()});
}

View File

@@ -0,0 +1,51 @@
/**
* @file biergarten_pipeline_orchestrator/query_cities_with_countries.cc
* @brief BiergartenDataGenerator::QueryCitiesWithCountries() implementation.
*/
#include <algorithm>
#include <chrono>
#include <filesystem>
#include <format>
#include <iterator>
#include <random>
#include "biergarten_pipeline_orchestrator.h"
#include "json_handling/json_loader.h"
#include "services/logging/logger.h"
std::vector<Location>
BiergartenPipelineOrchestrator::QueryCitiesWithCountries() {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "=== GEOGRAPHIC DATA OVERVIEW ==="});
const std::filesystem::path locations_path = "locations.json";
auto all_locations = JsonLoader::LoadLocations(locations_path, logger_);
const size_t sample_count = std::min(
static_cast<size_t>(application_options_.pipeline.location_count),
all_locations.size());
const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(
sample_count);
std::vector<Location> sampled_locations;
sampled_locations.reserve(sample_count);
std::random_device random_generator;
std::ranges::sample(all_locations, std::back_inserter(sampled_locations),
sample_count_signed, random_generator);
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format(" Locations available: {}",
all_locations.size())});
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format(" Sampled locations: {}",
sampled_locations.size())});
return sampled_locations;
}

View File

@@ -0,0 +1,63 @@
/**
* @file biergarten_pipeline_orchestrator/run.cc
* @brief BiergartenDataGenerator::Run() implementation.
*/
#include <chrono>
#include <format>
#include <utility>
#include "biergarten_pipeline_orchestrator.h"
#include "services/logging/logger.h"
bool BiergartenPipelineOrchestrator::Run() {
try {
exporter_->Initialize();
std::vector<Location> cities = QueryCitiesWithCountries();
std::vector<EnrichedCity> enriched;
enriched.reserve(cities.size());
size_t skipped_count = 0;
for (auto& city : cities) {
try {
std::string region_context = context_service_->GetLocationContext(city);
// logger_->Log(LogLevel::Debug, PipelinePhase::UserGeneration,
// "[Pipeline] Context for '" + city.city + "' (" +
// city.iso3166_2 + ") gathered:\n" + region_context);
enriched.push_back(
EnrichedCity{.location = std::move(city),
.region_context = std::move(region_context)});
} catch (const std::exception& exception) {
++skipped_count;
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format(
"[Pipeline] Skipping city '{}' ({}): context lookup failed: {}",
city.city, city.country, exception.what())});
}
}
if (skipped_count > 0) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format(
"[Pipeline] Skipped {} city/cities due to context lookup errors",
skipped_count)});
}
this->GenerateBreweries(enriched);
exporter_->Finalize();
this->LogResults();
return true;
} catch (const std::exception& e) {
logger_->Log(
{.level = LogLevel::Error,
.phase = PipelinePhase::Teardown,
.message =
std::format("Pipeline execution failed with error: {}", e.what())});
return false;
}
}

View File

@@ -4,8 +4,7 @@
* inference, and validates structured JSON output for brewery records. * inference, and validates structured JSON output for brewery records.
*/ */
#include <spdlog/spdlog.h> #include <chrono>
#include <format> #include <format>
#include <optional> #include <optional>
#include <stdexcept> #include <stdexcept>
@@ -33,6 +32,9 @@ static std::string FormatLocalLanguageCodes(
return formatted; return formatted;
} }
// GBNF grammar for structured brewery JSON output.
// @TODO move to a separate gbnf file if it grows in complexity or is shared
// across modules.
static constexpr std::string_view kBreweryJsonGrammar = R"json_brewery( static constexpr std::string_view kBreweryJsonGrammar = R"json_brewery(
root ::= thought-block "{" ws "\"name_en\"" ws ":" ws string ws "," ws "\"description_en\"" ws ":" ws string ws "," ws "\"name_local\"" ws ":" ws string ws "," ws "\"description_local\"" ws ":" ws string ws "}" ws root ::= thought-block "{" ws "\"name_en\"" ws ":" ws string ws "," ws "\"description_en\"" ws ":" ws string ws "," ws "\"name_local\"" ws ":" ws string ws "," ws "\"description_local\"" ws ":" ws string ws "}" ws
thought-block ::= [^{]* thought-block ::= [^{]*
@@ -59,11 +61,12 @@ BreweryResult LlamaGenerator::GenerateBrewery(
location.country.empty() ? std::string{} location.country.empty() ? std::string{}
: std::format(", {}", location.country); : std::format(", {}", location.country);
/** /**
* Load brewery system prompt from file * Load brewery system prompt via the injected prompt directory.
* Falls back to minimal inline prompt if file not found * The key "BREWERY_GENERATION" resolves to BREWERY_GENERATION.md inside
* the configured --prompt-dir. Throws on missing or empty file.
*/ */
const std::string system_prompt = const std::string system_prompt =
LoadBrewerySystemPrompt("prompts/system.md"); prompt_directory_->Load("BREWERY_GENERATION");
std::string user_prompt = std::format( std::string user_prompt = std::format(
"## CITY:\n{}\n\n## COUNTRY:\n{}\n\n## LOCAL LANGUAGE CODES:\n{}\n\n## " "## CITY:\n{}\n\n## COUNTRY:\n{}\n\n## LOCAL LANGUAGE CODES:\n{}\n\n## "
@@ -96,8 +99,13 @@ BreweryResult LlamaGenerator::GenerateBrewery(
// Generate brewery data from LLM // Generate brewery data from LLM
raw = this->Infer(system_prompt, user_prompt, max_tokens, raw = this->Infer(system_prompt, user_prompt, max_tokens,
kBreweryJsonGrammar); kBreweryJsonGrammar);
spdlog::debug("LlamaGenerator: raw output (attempt {}): {}", attempt + 1, if (logger_) {
raw); logger_->Log(
{.level = LogLevel::Debug,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format("LlamaGenerator: raw output (attempt {}): {}",
attempt + 1, raw)});
}
// Validate output: parse JSON and check required fields // Validate output: parse JSON and check required fields
@@ -108,9 +116,13 @@ BreweryResult LlamaGenerator::GenerateBrewery(
if (!validation_error.has_value()) { if (!validation_error.has_value()) {
// Success: return parsed brewery data // Success: return parsed brewery data
spdlog::info( if (logger_) {
"LlamaGenerator: successfully generated brewery data on attempt {}", logger_->Log(
attempt + 1); {.level = LogLevel::Info,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format("LlamaGenerator: successfully generated brewery data on attempt {}",
attempt + 1)});
}
return brewery; return brewery;
} }
@@ -118,8 +130,14 @@ BreweryResult LlamaGenerator::GenerateBrewery(
// Validation failed: log error and prepare corrective feedback // Validation failed: log error and prepare corrective feedback
last_error = *validation_error; last_error = *validation_error;
spdlog::warn("LlamaGenerator: malformed brewery JSON (attempt {}): {}", if (logger_) {
attempt + 1, *validation_error); logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::BreweryAndBeerGeneration,
.message =
std::format("LlamaGenerator: malformed brewery JSON (attempt {}): {}",
attempt + 1, *validation_error)});
}
// Update prompt with error details to guide LLM toward correct output. // Update prompt with error details to guide LLM toward correct output.
user_prompt = std::format( user_prompt = std::format(
@@ -136,9 +154,13 @@ BreweryResult LlamaGenerator::GenerateBrewery(
} }
// All retry attempts exhausted: log failure and throw exception // All retry attempts exhausted: log failure and throw exception
spdlog::error( if (logger_) {
"LlamaGenerator: malformed brewery response after {} attempts: " logger_->Log(
"{}", {.level = LogLevel::Error,
max_attempts, last_error.empty() ? raw : last_error); .phase = PipelinePhase::BreweryAndBeerGeneration,
.message = std::format(
"LlamaGenerator: malformed brewery response after {} attempts: {}",
max_attempts, last_error.empty() ? raw : last_error)});
}
throw std::runtime_error("LlamaGenerator: malformed brewery response"); throw std::runtime_error("LlamaGenerator: malformed brewery response");
} }

View File

@@ -4,15 +4,21 @@
* retry handling, and output sanitization for downstream parsing. * retry handling, and output sanitization for downstream parsing.
*/ */
#include <spdlog/spdlog.h>
#include <stdexcept> #include <format>
#include <string> #include <string>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "data_generation/llama_generator_helpers.h" #include "data_generation/llama_generator_helpers.h"
// TODO: Implement locale-aware user profile generation.
// Current implementation returns a hardcoded test value and ignores the
// locale parameter. Future implementation should:
// 1. Load a USER_GENERATION.md prompt template with locale context
// 2. Perform LLM inference with locale-specific username/bio generation
// 3. Parse and validate JSON output with retry handling (similar to brewery)
// 4. Return locale-aware username and biography
UserResult LlamaGenerator::GenerateUser(const std::string& locale) { UserResult LlamaGenerator::GenerateUser(const std::string& locale) {
return {.username = "test_user", return {.username = "test_user",
.bio = "This is a test user profile from " + locale + "."}; .bio = std::format("This is a test user profile from {}.", locale)};
} }

View File

@@ -16,11 +16,11 @@
#include "data_generation/llama_generator_helpers.h" #include "data_generation/llama_generator_helpers.h"
#include "llama.h" #include "llama.h"
namespace {
/** /**
* String trimming: removes leading and trailing whitespace * String trimming: removes leading and trailing whitespace
*/ */
static std::string Trim(std::string_view value) { std::string Trim(std::string_view value) {
constexpr std::string_view whitespace = " \t\n\r\f\v"; constexpr std::string_view whitespace = " \t\n\r\f\v";
const size_t first_index = value.find_first_not_of(whitespace); const size_t first_index = value.find_first_not_of(whitespace);
if (first_index == std::string_view::npos) { if (first_index == std::string_view::npos) {
@@ -35,7 +35,7 @@ static std::string Trim(std::string_view value) {
* Normalize whitespace: collapses multiple spaces/tabs/newlines into single * Normalize whitespace: collapses multiple spaces/tabs/newlines into single
* spaces * spaces
*/ */
static std::string CondenseWhitespace(std::string_view text) { std::string CondenseWhitespace(std::string_view text) {
std::string out; std::string out;
out.reserve(text.size()); out.reserve(text.size());
@@ -58,6 +58,41 @@ static std::string CondenseWhitespace(std::string_view text) {
return out; return out;
} }
// Guard against truncating in the first half of the string.
// This preserves the critical opening content and avoids cutting critical
// context words early in the region description.
constexpr size_t kTruncationGuardDivisor = 2;
bool ReadRequiredTrimmedStringField(const boost::json::object& obj,
std::string_view key, std::string& out,
std::string* error_out) {
const boost::json::value* field = obj.if_contains(key);
if (field == nullptr || !field->is_string()) {
return false;
}
const auto& string_value = field->as_string();
out = Trim(std::string_view(string_value.data(), string_value.size()));
return !out.empty();
}
bool HasSchemaPlaceholder(const std::array<std::string*, 4>& values) {
for (const std::string* value : values) {
std::string lowered = *value;
std::ranges::transform(lowered, lowered.begin(),
[](const unsigned char character) {
return static_cast<char>(std::tolower(character));
});
if (lowered == "string") {
return true;
}
}
return false;
}
} // namespace
/** /**
* Truncate region context to fit within max length while preserving word * Truncate region context to fit within max length while preserving word
* boundaries * boundaries
@@ -71,7 +106,8 @@ std::string PrepareRegionContext(std::string_view region_context,
normalized.resize(max_chars); normalized.resize(max_chars);
const size_t last_space = normalized.find_last_of(' '); const size_t last_space = normalized.find_last_of(' ');
if (last_space != std::string::npos && last_space > max_chars / 2) { if (last_space != std::string::npos &&
last_space > max_chars / kTruncationGuardDivisor) {
normalized.resize(last_space); normalized.resize(last_space);
} }
@@ -115,47 +151,6 @@ void AppendTokenPiece(const llama_vocab* vocab, llama_token token,
"LlamaGenerator: failed to decode sampled token piece"); "LlamaGenerator: failed to decode sampled token piece");
} }
static bool ReadRequiredTrimmedStringField(const boost::json::object& obj,
std::string_view key,
std::string& out,
std::string* error_out) {
const boost::json::value* field = obj.if_contains(key);
if (field == nullptr || !field->is_string()) {
if (error_out != nullptr) {
*error_out =
"JSON field '" + std::string(key) + "' is missing or not a string";
}
return false;
}
const auto& string_value = field->as_string();
out = Trim(std::string_view(string_value.data(), string_value.size()));
if (out.empty()) {
if (error_out != nullptr) {
*error_out = "JSON field '" + std::string(key) + "' must not be empty";
}
return false;
}
return true;
}
static bool HasSchemaPlaceholder(const std::array<std::string*, 4>& values) {
for (const std::string* value : values) {
std::string lowered = *value;
std::ranges::transform(lowered, lowered.begin(),
[](unsigned char character) {
return static_cast<char>(std::tolower(character));
});
if (lowered == "string") {
return true;
}
}
return false;
}
std::optional<std::string> ValidateBreweryJson(const std::string& raw, std::optional<std::string> ValidateBreweryJson(const std::string& raw,
BreweryResult& brewery_out) { BreweryResult& brewery_out) {
boost::system::error_code error_code; boost::system::error_code error_code;
@@ -203,7 +198,7 @@ std::optional<std::string> ValidateBreweryJson(const std::string& raw,
return validation_error; return validation_error;
} }
const std::array<std::string*, 4> schema_placeholders = { const std::array schema_placeholders = {
&brewery_out.name_en, &brewery_out.description_en, &brewery_out.name_en, &brewery_out.description_en,
&brewery_out.name_local, &brewery_out.description_local}; &brewery_out.name_local, &brewery_out.description_local};
if (HasSchemaPlaceholder(schema_placeholders)) { if (HasSchemaPlaceholder(schema_placeholders)) {

View File

@@ -5,9 +5,9 @@
* output tokens back to text for system+user chat prompts. * output tokens back to text for system+user chat prompts.
*/ */
#include <spdlog/spdlog.h>
#include <algorithm> #include <algorithm>
#include <chrono>
#include <format>
#include <memory> #include <memory>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
@@ -19,6 +19,9 @@
#include "llama.h" #include "llama.h"
static constexpr size_t kPromptTokenSlack = 8; static constexpr size_t kPromptTokenSlack = 8;
// Minimum tokens to keep when using top-p sampling. Ensures at least one
// candidate token remains available even with very restrictive top-p values.
static constexpr size_t kTopPMinKeep = 1;
namespace { namespace {
@@ -62,7 +65,7 @@ SamplerHandle MakeSamplerChain(const llama_vocab* vocab,
"LlamaGenerator: failed to initialize temperature sampler"); "LlamaGenerator: failed to initialize temperature sampler");
add_sampler(llama_sampler_init_top_k(static_cast<int32_t>(config.top_k)), add_sampler(llama_sampler_init_top_k(static_cast<int32_t>(config.top_k)),
"LlamaGenerator: failed to initialize top-k sampler"); "LlamaGenerator: failed to initialize top-k sampler");
add_sampler(llama_sampler_init_top_p(config.top_p, 1), add_sampler(llama_sampler_init_top_p(config.top_p, kTopPMinKeep),
"LlamaGenerator: failed to initialize top-p sampler"); "LlamaGenerator: failed to initialize top-p sampler");
add_sampler(llama_sampler_init_dist(config.seed), add_sampler(llama_sampler_init_dist(config.seed),
"LlamaGenerator: failed to initialize distribution sampler"); "LlamaGenerator: failed to initialize distribution sampler");
@@ -104,7 +107,7 @@ std::string LlamaGenerator::InferFormatted(const std::string& formatted_prompt,
.top_p = sampling_top_p_, .top_p = sampling_top_p_,
.seed = static_cast<uint32_t>(rng_()), .seed = static_cast<uint32_t>(rng_()),
}; };
auto sampler = MakeSamplerChain(vocab, sampler_config, grammar); const auto sampler = MakeSamplerChain(vocab, sampler_config, grammar);
/** /**
* Clear KV cache to ensure clean inference state (no residual context) * Clear KV cache to ensure clean inference state (no residual context)
@@ -168,10 +171,14 @@ std::string LlamaGenerator::InferFormatted(const std::string& formatted_prompt,
*/ */
prompt_tokens.resize(static_cast<size_t>(token_count)); prompt_tokens.resize(static_cast<size_t>(token_count));
if (token_count > prompt_budget) { if (token_count > prompt_budget) {
spdlog::warn( if (logger_) {
"LlamaGenerator: prompt too long ({} tokens), truncating to {} " logger_->Log({.level = LogLevel::Warn,
"tokens to fit n_batch/n_ctx limits", .phase = PipelinePhase::BreweryAndBeerGeneration,
token_count, prompt_budget); .message = std::format(
"LlamaGenerator: prompt too long ({} tokens), "
"truncating to {} tokens to fit n_batch/n_ctx limits",
token_count, prompt_budget)});
}
prompt_tokens.resize(static_cast<size_t>(prompt_budget)); prompt_tokens.resize(static_cast<size_t>(prompt_budget));
token_count = prompt_budget; token_count = prompt_budget;
} }

View File

@@ -11,7 +11,7 @@
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "data_model/application_options.h" #include "data_model/models.h"
#include "llama.h" #include "llama.h"
static constexpr uint32_t kMaxContextSize = 32768U; static constexpr uint32_t kMaxContextSize = 32768U;
@@ -32,9 +32,13 @@ void LlamaGenerator::ContextDeleter::operator()(
LlamaGenerator::LlamaGenerator( LlamaGenerator::LlamaGenerator(
const ApplicationOptions& options, const std::string& model_path, const ApplicationOptions& options, const std::string& model_path,
std::unique_ptr<IPromptFormatter> prompt_formatter) std::shared_ptr<ILogger> logger,
std::unique_ptr<IPromptFormatter> prompt_formatter,
std::unique_ptr<IPromptDirectory> prompt_directory)
: rng_(std::random_device{}()), : rng_(std::random_device{}()),
prompt_formatter_(std::move(prompt_formatter)) { logger_(std::move(logger)),
prompt_formatter_(std::move(prompt_formatter)),
prompt_directory_(std::move(prompt_directory)) {
if (model_path.empty()) { if (model_path.empty()) {
throw std::runtime_error("LlamaGenerator: model path must not be empty"); throw std::runtime_error("LlamaGenerator: model path must not be empty");
} }
@@ -44,41 +48,50 @@ LlamaGenerator::LlamaGenerator(
"LlamaGenerator: prompt formatter dependency must not be null"); "LlamaGenerator: prompt formatter dependency must not be null");
} }
if (options.temperature < 0.0F) { if (!prompt_directory_) {
throw std::runtime_error(
"LlamaGenerator: prompt directory dependency must not be null");
}
const auto sampling = options.generator.sampling.value_or(SamplingOptions{});
if (sampling.temperature < 0.0F) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: sampling temperature must be >= 0"); "LlamaGenerator: sampling temperature must be >= 0");
} }
if (options.top_p <= 0.0F || options.top_p > 1.0F) { if (sampling.top_p <= 0.0F || sampling.top_p > 1.0F) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: sampling top-p must be in (0, 1]"); "LlamaGenerator: sampling top-p must be in (0, 1]");
} }
if (options.top_k == 0U) { if (sampling.top_k == 0U) {
throw std::runtime_error("LlamaGenerator: sampling top-k must be > 0"); throw std::runtime_error("LlamaGenerator: sampling top-k must be > 0");
} }
if (options.seed < -1) { if (sampling.seed < -1) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: seed must be >= 0, or -1 for random"); "LlamaGenerator: seed must be >= 0, or -1 for random");
} }
if (options.n_ctx == 0 || options.n_ctx > kMaxContextSize) { if (sampling.n_ctx == 0 || sampling.n_ctx > kMaxContextSize) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: context size must be in range [1, 32768]"); "LlamaGenerator: context size must be in range [1, 32768]");
} }
sampling_temperature_ = options.temperature; sampling_temperature_ = sampling.temperature;
sampling_top_p_ = options.top_p; sampling_top_p_ = sampling.top_p;
sampling_top_k_ = options.top_k; sampling_top_k_ = sampling.top_k;
if (options.seed == -1) { if (sampling.seed == -1) {
std::random_device random_device; std::random_device random_device;
rng_.seed(random_device()); rng_.seed(random_device());
} else { } else {
rng_.seed(static_cast<uint32_t>(options.seed)); rng_.seed(static_cast<uint32_t>(sampling.seed));
} }
n_ctx_ = options.n_ctx;
n_ctx_ = sampling.n_ctx;
n_gpu_layers_ = sampling.n_gpu_layers;
this->Load(model_path); this->Load(model_path);
} }

View File

@@ -4,23 +4,34 @@
* context, and resets prior resources during model initialization. * context, and resets prior resources during model initialization.
*/ */
#include <spdlog/spdlog.h>
#include <algorithm> #include <algorithm>
#include <chrono>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include <utility> #include <utility>
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "ggml-backend.h"
#include "llama.h" #include "llama.h"
// Maximum batch size for decode operations. Capping the batch prevents
// excessive memory allocation while maintaining inference performance.
static constexpr uint32_t kMaxBatchSize = 5000U;
void LlamaGenerator::Load(const std::string& model_path) { void LlamaGenerator::Load(const std::string& model_path) {
context_.reset(); context_.reset();
model_.reset(); model_.reset();
const llama_model_params model_params = llama_model_default_params(); // Specifically load dynamic ggml backends (like CUDA) that are provided
LlamaGenerator::ModelHandle loaded_model( // externally before attempting to load a model.
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers_;
ModelHandle loaded_model(
llama_model_load_from_file(model_path.c_str(), model_params)); llama_model_load_from_file(model_path.c_str(), model_params));
if (!loaded_model) { if (!loaded_model) {
throw std::runtime_error( throw std::runtime_error(
"LlamaGenerator: failed to load model from path: " + model_path); "LlamaGenerator: failed to load model from path: " + model_path);
@@ -28,10 +39,11 @@ void LlamaGenerator::Load(const std::string& model_path) {
llama_context_params context_params = llama_context_default_params(); llama_context_params context_params = llama_context_default_params();
context_params.n_ctx = n_ctx_; context_params.n_ctx = n_ctx_;
context_params.n_batch = std::min(n_ctx_, static_cast<uint32_t>(5000)); context_params.n_batch = std::min(n_ctx_, kMaxBatchSize);
LlamaGenerator::ContextHandle loaded_context( ContextHandle loaded_context(
llama_init_from_model(loaded_model.get(), context_params)); llama_init_from_model(loaded_model.get(), context_params));
if (!loaded_context) { if (!loaded_context) {
throw std::runtime_error("LlamaGenerator: failed to create context"); throw std::runtime_error("LlamaGenerator: failed to create context");
} }
@@ -39,5 +51,10 @@ void LlamaGenerator::Load(const std::string& model_path) {
model_ = std::move(loaded_model); model_ = std::move(loaded_model);
context_ = std::move(loaded_context); context_ = std::move(loaded_context);
spdlog::info("[LlamaGenerator] Loaded model: {}", model_path); if (logger_) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format("[LlamaGenerator] Loaded model: {} ",
model_path)});
}
} }

View File

@@ -1,55 +0,0 @@
/**
* @file data_generation/llama/load_brewery_prompt.cc
* @brief Resolves brewery system prompt content from cache or a configured
* filesystem path and provides a robust inline fallback prompt when absent.
*/
#include <spdlog/spdlog.h>
#include <filesystem>
#include <fstream>
#include <stdexcept>
#include "data_generation/llama_generator.h"
/**
* @brief Loads brewery system prompt from disk or cache.
*
* @param prompt_file_path Preferred prompt file location.
* @return Prompt text loaded from disk.
*/
std::string LlamaGenerator::LoadBrewerySystemPrompt(
const std::filesystem::path& prompt_file_path) {
// Return cached version if already loaded
if (!brewery_system_prompt_.empty()) {
return brewery_system_prompt_;
}
std::ifstream prompt_file(prompt_file_path);
if (!prompt_file.is_open()) {
spdlog::error(
"LlamaGenerator: Failed to open brewery system prompt file '{}'",
prompt_file_path.string());
throw std::runtime_error(
"LlamaGenerator: missing brewery system prompt file: " +
prompt_file_path.string());
}
const std::string prompt((std::istreambuf_iterator(prompt_file)),
std::istreambuf_iterator<char>());
prompt_file.close();
if (prompt.empty()) {
spdlog::error("LlamaGenerator: Brewery system prompt file '{}' is empty",
prompt_file_path.string());
throw std::runtime_error(
"LlamaGenerator: empty brewery system prompt file: " +
prompt_file_path.string());
}
spdlog::info(
"LlamaGenerator: Loaded brewery system prompt from '{}' ({} chars)",
prompt_file_path.string(), prompt.length());
brewery_system_prompt_ = prompt;
return brewery_system_prompt_;
}

View File

@@ -17,9 +17,9 @@ BreweryResult MockGenerator::GenerateBrewery(
const std::string_view adjective = const std::string_view adjective =
kBreweryAdjectives.at(hash % kBreweryAdjectives.size()); kBreweryAdjectives.at(hash % kBreweryAdjectives.size());
const std::string_view noun = const std::string_view noun =
kBreweryNouns.at(hash / 7 % kBreweryNouns.size()); kBreweryNouns.at(hash / kNounHashStride % kBreweryNouns.size());
const std::string_view base_description = const std::string_view base_description = kBreweryDescriptions.at(
kBreweryDescriptions.at((hash / 13) % kBreweryDescriptions.size()); (hash / kDescriptionHashStride) % kBreweryDescriptions.size());
const std::string name = const std::string name =
std::format("{} {} {}", location.city, adjective, noun); std::format("{} {} {}", location.city, adjective, noun);

View File

@@ -15,7 +15,7 @@ UserResult MockGenerator::GenerateUser(const std::string& locale) {
UserResult result; UserResult result;
const std::string_view username = kUsernames[hash % kUsernames.size()]; const std::string_view username = kUsernames[hash % kUsernames.size()];
const std::string_view bio = kBios[hash / 11 % kBios.size()]; const std::string_view bio = kBios[hash / kBioHashStride % kBios.size()];
result.username = username; result.username = username;
result.bio = bio; result.bio = bio;
return result; return result;

View File

@@ -6,7 +6,9 @@
#include "json_handling/json_loader.h" #include "json_handling/json_loader.h"
#include <spdlog/spdlog.h> #include <format>
#include "services/logging/logger.h"
#include <iostream>
#include <boost/json.hpp> #include <boost/json.hpp>
#include <fstream> #include <fstream>
@@ -19,8 +21,8 @@ static std::string ReadRequiredString(const boost::json::object& object,
const char* key) { const char* key) {
const boost::json::value* value = object.if_contains(key); const boost::json::value* value = object.if_contains(key);
if (value == nullptr || !value->is_string()) { if (value == nullptr || !value->is_string()) {
throw std::runtime_error(std::string("Missing or invalid string field: ") + throw std::runtime_error(
key); std::format("Missing or invalid string field: {}", key));
} }
const std::string_view text = value->as_string(); const std::string_view text = value->as_string();
return std::string(text); return std::string(text);
@@ -30,8 +32,8 @@ static double ReadRequiredNumber(const boost::json::object& object,
const char* key) { const char* key) {
const boost::json::value* value = object.if_contains(key); const boost::json::value* value = object.if_contains(key);
if (value == nullptr || !value->is_number()) { if (value == nullptr || !value->is_number()) {
throw std::runtime_error(std::string("Missing or invalid numeric field: ") + throw std::runtime_error(
key); std::format("Missing or invalid numeric field: {}", key));
} }
return value->to_number<double>(); return value->to_number<double>();
} }
@@ -41,7 +43,7 @@ static std::vector<std::string> ReadRequiredStringArray(
const boost::json::value* value = object.if_contains(key); const boost::json::value* value = object.if_contains(key);
if (value == nullptr || !value->is_array()) { if (value == nullptr || !value->is_array()) {
throw std::runtime_error( throw std::runtime_error(
std::string("Missing or invalid string array field: ") + key); std::format("Missing or invalid string array field: {}", key));
} }
const auto& array = value->as_array(); const auto& array = value->as_array();
@@ -50,7 +52,7 @@ static std::vector<std::string> ReadRequiredStringArray(
for (const auto& item : array) { for (const auto& item : array) {
if (!item.is_string()) { if (!item.is_string()) {
throw std::runtime_error( throw std::runtime_error(
std::string("Missing or invalid string array field: ") + key); std::format("Missing or invalid string array field: {}", key));
} }
items.emplace_back(item.as_string()); items.emplace_back(item.as_string());
} }
@@ -58,7 +60,7 @@ static std::vector<std::string> ReadRequiredStringArray(
} }
std::vector<Location> JsonLoader::LoadLocations( std::vector<Location> JsonLoader::LoadLocations(
const std::filesystem::path& filepath) { const std::filesystem::path& filepath, std::shared_ptr<ILogger> logger) {
std::ifstream input(filepath); std::ifstream input(filepath);
if (!input.is_open()) { if (!input.is_open()) {
throw std::runtime_error("Failed to open locations file: " + throw std::runtime_error("Failed to open locations file: " +
@@ -104,7 +106,5 @@ std::vector<Location> JsonLoader::LoadLocations(
}); });
} }
spdlog::info("[JsonLoader] Loaded {} locations from {}", locations.size(),
filepath.string());
return locations; return locations;
} }

View File

@@ -4,195 +4,215 @@
* initializes shared infrastructure, and executes the pipeline entry flow. * initializes shared infrastructure, and executes the pipeline entry flow.
*/ */
#include <spdlog/fmt/fmt.h>
#include <spdlog/spdlog.h> #include <spdlog/spdlog.h>
#include <boost/di.hpp> #include <boost/di.hpp>
#include <boost/program_options.hpp> #include <boost/program_options.hpp>
#include <chrono> #include <chrono>
#include <exception> #include <exception>
#include <format>
#include <iostream>
#include <memory> #include <memory>
#include <optional> #include <optional>
#include <sstream>
#include <string> #include <string>
#include <thread>
#include "biergarten_data_generator.h" #include "biergarten_pipeline_orchestrator.h"
#include "concurrency/bounded_channel.h"
#include "data_generation/llama_generator.h" #include "data_generation/llama_generator.h"
#include "data_generation/mock_generator.h" #include "data_generation/mock_generator.h"
#include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h" #include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h"
#include "data_model/application_options.h" #include "data_model/models.h"
#include "llama_backend_state.h" #include "llama_backend_state.h"
#include "services/enrichment_service.h" #include "services/database/export_service.h"
#include "services/export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service.h" #include "services/datetime/timer.h"
#include "services/wikipedia_service.h" #include "services/enrichment/enrichment_service.h"
#include "web_client/curl_web_client.h" #include "services/enrichment/mock_enrichment.h"
#include "services/enrichment/wikipedia_service.h"
#include "services/logging/log_dispatcher.h"
#include "services/logging/log_entry.h"
#include "services/logging/log_producer.h"
#include "services/logging/logger.h"
#include "services/prompting/prompt_directory.h"
#include "web_client/http_web_client.h"
namespace prog_opts = boost::program_options;
namespace di = boost::di; namespace di = boost::di;
/** static constexpr size_t kLogMaxCount = 512;
* @brief Parse command-line arguments into ApplicationOptions.
*
* @param argc Command-line argument count.
* @param argv Command-line arguments.
* @return Parsed ApplicationOptions if parsing succeeded, std::nullopt
* otherwise.
*/
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
opt("temperature", prog_opts::value<float>()->default_value(1.0F),
"Sampling temperature (higher = more random)");
opt("top-p", prog_opts::value<float>()->default_value(0.95F),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k", prog_opts::value<uint32_t>()->default_value(64),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx", prog_opts::value<uint32_t>()->default_value(8192),
"Context window size in tokens (1-32768)");
opt("seed", prog_opts::value<int>()->default_value(-1),
"Sampler seed: -1 for random, otherwise non-negative integer");
// Handle the "no arguments" or "help" case
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map variables_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc),
variables_map);
prog_opts::notify(variables_map);
if (variables_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
const auto use_mocked = variables_map["mocked"].as<bool>();
const auto model_path = variables_map["model"].as<std::string>();
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: Either --mocked or --model must be specified");
return std::nullopt;
}
const bool has_llm_params = !variables_map["temperature"].defaulted() ||
!variables_map["top-p"].defaulted() ||
!variables_map["top-k"].defaulted() ||
!variables_map["seed"].defaulted();
if (use_mocked && has_llm_params) {
spdlog::warn(
"Sampling parameters (--temperature, --top-p, --top-k, --seed) are"
" ignored when using --mocked");
}
ApplicationOptions options;
options.use_mocked = use_mocked;
options.model_path = model_path;
options.temperature = variables_map["temperature"].as<float>();
options.top_p = variables_map["top-p"].as<float>();
options.top_k = variables_map["top-k"].as<uint32_t>();
options.n_ctx = variables_map["n-ctx"].as<uint32_t>();
options.seed = variables_map["seed"].as<int>();
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}
struct Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
};
int main(const int argc, char** argv) { int main(const int argc, char** argv) {
spdlog::set_level(spdlog::level::debug);
spdlog::set_pattern("│ %Y-%m-%d %H:%M:%S.%e │ %^%-7l%$ │ %v");
BoundedChannel<LogEntry> log_channel(kLogMaxCount);
auto log_dispatcher = //
std::make_unique<LogDispatcher>(log_channel);
std::shared_ptr<ILogger> log_producer =
std::make_shared<LogProducer>(log_channel);
std::thread log_thread([&log_dispatcher] { log_dispatcher->Run(); });
auto shutdown = [&](const int exit_code) {
log_channel.Close();
log_thread.join();
return exit_code;
};
try { try {
Timer timer; Timer timer;
const CurlGlobalState curl_state;
const LlamaBackendState llama_backend_state;
spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v");
const auto parsed_options = ParseArguments(argc, argv); #ifndef BIERGARTEN_MOCK_ONLY
const LlamaBackendState llama_backend_state;
#endif
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "STARTING PIPELINE"});
const std::optional<ApplicationOptions> parsed_options =
ParseArguments(argc, argv, log_producer);
if (!parsed_options.has_value()) { if (!parsed_options.has_value()) {
return 0; return shutdown(EXIT_FAILURE);
} }
const auto options = *parsed_options; const auto options = *parsed_options;
const std::string model_path = options.generator.model_path.string();
const auto sampling =
options.generator.sampling.value_or(SamplingOptions{});
std::unique_ptr<IPromptDirectory> prompt_directory;
if (!options.generator.use_mocked) {
try {
prompt_directory = std::make_unique<PromptDirectory>(
options.pipeline.prompt_dir, log_producer);
} catch (const std::exception& dir_error) {
log_producer->Log({.level = LogLevel::Error,
.phase = PipelinePhase::Startup,
.message = std::format("Invalid --prompt-dir: {}",
dir_error.what())});
return shutdown(EXIT_FAILURE);
}
}
const auto injector = di::make_injector( const auto injector = di::make_injector(
di::bind<WebClient>().to<CURLWebClient>(), di::bind<ILogger>().to(log_producer),
di::bind<ApplicationOptions>().to(options), di::bind<ApplicationOptions>().to(options),
di::bind<IEnrichmentService>().to<WikipediaService>(), di::bind<std::string>().to(model_path),
di::bind<IExportService>().to<SqliteExportService>(), di::bind<IExportService>().to<SqliteExportService>(),
di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(), di::bind<IPromptFormatter>().to([options, log_producer] {
di::bind<std::string>().to(options.model_path), if (options.generator.use_mocked) {
{
log_producer->Log(
{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Prompt formatter: none (mock mode)"});
}
return std::unique_ptr<IPromptFormatter>(nullptr);
}
{
log_producer->Log(
{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Prompt formatter: Gemma4JinjaPromptFormatter"});
}
return std::unique_ptr<IPromptFormatter>(
std::make_unique<Gemma4JinjaPromptFormatter>());
}),
di::bind<WebClient>().to([options, log_producer] {
if (options.generator.use_mocked) {
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Web client: none (mock mode)"});
}
return std::unique_ptr<WebClient>(nullptr);
}
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Web client: HttpWebClient"});
}
return std::unique_ptr<WebClient>(
std::make_unique<HttpWebClient>(log_producer));
}),
di::bind<IEnrichmentService>().to(
[options, &log_producer](
const auto& inj) -> std::unique_ptr<IEnrichmentService> {
if (options.generator.use_mocked) {
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Enrichment: mock"});
}
return std::make_unique<MockEnrichmentService>();
}
{
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Enrichment: Wikipedia"});
}
return std::make_unique<WikipediaEnrichmentService>(
inj.template create<std::unique_ptr<WebClient>>(),
log_producer);
}),
di::bind<DataGenerator>().to( di::bind<DataGenerator>().to(
[options](const auto& inj) -> std::unique_ptr<DataGenerator> { [&options, &model_path, &sampling, &prompt_directory,
if (options.use_mocked) { &log_producer](const auto& inj) -> std::unique_ptr<DataGenerator> {
spdlog::info( if (options.generator.use_mocked) {
"[Generator] Using MockGenerator (no model path provided)"); {
log_producer->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = "Generator: mock"});
}
return std::make_unique<MockGenerator>(); return std::make_unique<MockGenerator>();
} }
{
spdlog::info( log_producer->Log(
"[Generator] Using LlamaGenerator: {} (temperature={}, " {.level = LogLevel::Info,
"top-p={}, top-k={}, n_ctx={}, seed={})", .phase = PipelinePhase::Startup,
options.model_path, options.temperature, options.top_p, .message = std::format(
options.top_k, options.n_ctx, options.seed); "Generator: LlamaGenerator | model={} | temp={:.2f} "
return inj.template create<std::unique_ptr<LlamaGenerator>>(); "top_p={:.2f} top_k={} n_ctx={} seed={}",
model_path, sampling.temperature, sampling.top_p,
sampling.top_k, sampling.n_ctx, sampling.seed)});
}
return std::make_unique<LlamaGenerator>(
options, model_path, log_producer,
inj.template create<std::unique_ptr<IPromptFormatter>>(),
std::move(prompt_directory));
})); }));
auto generator = const auto orchestrator =
injector.create<std::unique_ptr<BiergartenDataGenerator>>(); injector.create<std::unique_ptr<BiergartenPipelineOrchestrator>>();
if (!generator->Run()) { if (!orchestrator->Run()) {
spdlog::error("Pipeline execution failed"); log_producer->Log({.level = LogLevel::Error,
return 1; .phase = PipelinePhase::Teardown,
.message = "Pipeline execution failed"});
return shutdown(EXIT_FAILURE);
} }
spdlog::info("Pipeline executed successfully in {} ms", timer.Elapsed()); log_producer->Log({.level = LogLevel::Info,
return 0; .phase = PipelinePhase::Teardown,
.message = std::format("Pipeline complete in {} ms",
timer.Elapsed())});
return shutdown(EXIT_SUCCESS);
} catch (const std::exception& exception) { } catch (const std::exception& exception) {
spdlog::critical("Unhandled fatal error in main: {}", exception.what()); const LogDTO log_entry{.level = LogLevel::Error,
return 1; .phase = PipelinePhase::Teardown,
.message = exception.what()};
if (log_producer) {
log_producer->Log(log_entry);
} else {
std::cerr << log_entry.message << std::endl;
}
return shutdown(EXIT_FAILURE);
} }
} }

View File

@@ -0,0 +1,160 @@
/**
* @file wikipedia/fetch_extract.cc
*/
#include <boost/json.hpp>
#include <chrono>
#include <format>
#include <string>
#include <string_view>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
using namespace boost;
std::string WikipediaEnrichmentService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
// 1. Cache Lookup
if (const auto cache_it = this->extract_cache_.find(cache_key);
cache_it != this->extract_cache_.end()) {
if (logger_) {
logger_->Log({.level = LogLevel::Debug,
.phase = PipelinePhase::UserGeneration,
.message = std::format("Wikipedia: Cache hit for {}!", cache_key)});
}
return cache_it->second;
}
const std::string encoded = this->client_->EncodeURL(cache_key);
const std::string url = std::format(
"https://en.wikipedia.org/w/"
"api.php?action=query&titles={}&prop=extracts&explaintext=1&format=json",
encoded);
const std::string body = this->client_->Get(url);
{
using namespace std::literals::chrono_literals;
std::this_thread::sleep_for(1s);
}
// 2. Parse JSON
system::error_code ec;
json::value doc = json::parse(body, ec);
if (ec) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: JSON parse error for '{}': {}",
std::string(query), ec.message())});
}
return {};
}
// 3. Safe Extraction
const json::object* obj = doc.if_object();
if (obj == nullptr) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: Expected root object for '{}'",
std::string(query))});
}
return {};
}
const json::value* query_ptr = obj->if_contains("query");
const json::value* pages_ptr =
((query_ptr != nullptr) && query_ptr->is_object())
? query_ptr->get_object().if_contains("pages")
: nullptr;
if ((pages_ptr == nullptr) || !pages_ptr->is_object()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: Missing query.pages for '{}'",
std::string(query))});
}
return {};
}
const json::object& pages = pages_ptr->get_object();
if (pages.empty()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: No pages returned for '{}'",
std::string(query))});
}
this->extract_cache_.emplace(cache_key, "");
return {};
}
// Wikipedia returns the page under a dynamic ID key; we just want the first
// one
const json::value& page_val = pages.begin()->value();
if (!page_val.is_object()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: Unexpected page format for '{}'",
std::string(query))});
}
return {};
}
const json::object& page = page_val.get_object();
// Handle 404/Missing status
if (page.contains("missing")) {
if (logger_) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: Page '{}' does not exist",
std::string(query))});
}
this->extract_cache_.emplace(cache_key, "");
return {};
}
const json::value* extract_ptr = page.if_contains("extract");
if ((extract_ptr == nullptr) || !extract_ptr->is_string()) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("WikipediaService: No extract string found for '{}'",
std::string(query))});
}
this->extract_cache_.emplace(cache_key, "");
return {};
}
// 4. Success
std::string extract(extract_ptr->as_string());
if (logger_) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService: Fetched {} chars for '{}'",
extract.size(), std::string(query))});
}
this->extract_cache_.insert_or_assign(cache_key, extract);
return extract;
}

View File

@@ -0,0 +1,70 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <chrono>
#include <format>
#include <string>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
std::string WikipediaEnrichmentService::GetLocationContext(
const Location& loc) {
using namespace std::literals::chrono_literals;
if (!this->client_) {
if (logger_) {
logger_->Log({.level = LogLevel::Warn,
.phase = PipelinePhase::UserGeneration,
.message = "Wikipedia client is nullptr."});
}
return {};
}
std::string result;
// std::string region_query(loc.city);
// if (!loc.country.empty()) {
// region_query += loc.state_province,
// region_query += ", ";
// region_query += loc.country;
// }
constexpr std::string_view brewing_query = "brewing";
const std::string location_query =
std::format("{}, {}", loc.city, loc.iso3166_2);
const std::string beer_query = std::format("beer in {}", loc.country);
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(brewing_query));
append_extract(FetchExtract(beer_query));
if (logger_) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::UserGeneration,
.message = std::format("Done fetching for {}. Sleeping for 10 seconds.",
location_query)});
}
std::this_thread::sleep_for(10s);
} catch (const std::runtime_error& e) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Debug,
.phase = PipelinePhase::UserGeneration,
.message = std::format("WikipediaService lookup failed for '{}': {}",
location_query, e.what())});
}
}
return result;
}

View File

@@ -0,0 +1,12 @@
/**
* @file services/wikipedia/wikipedia_service.cc
* @brief WikipediaService constructor implementation.
*/
#include "services/enrichment/wikipedia_service.h"
#include <utility>
WikipediaEnrichmentService::WikipediaEnrichmentService(
std::unique_ptr<WebClient> client, std::shared_ptr<ILogger> logger)
: client_(std::move(client)), logger_(std::move(logger)) {}

View File

@@ -0,0 +1,74 @@
/**
* @brief LogDispatcher implementation for asynchronous pipeline logging.
*
* LogDispatcher drains LogEntry items from a BoundedChannel and forwards them
* to spdlog for final output.
*/
#include "services/logging/log_dispatcher.h"
#include <spdlog/spdlog.h>
#include <string>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
namespace {
[[nodiscard]] constexpr std::string_view PipelinePhaseToString(
PipelinePhase phase) {
switch (phase) {
case PipelinePhase::Startup:
return "Startup";
case PipelinePhase::UserGeneration:
return "User Generation";
case PipelinePhase::BreweryAndBeerGeneration:
return "Brewery & Beer Gen";
case PipelinePhase::CheckinGeneration:
return "Checkin Gen";
case PipelinePhase::RatingGeneration:
return "Rating Gen";
case PipelinePhase::FollowGeneration:
return "Follow Gen";
case PipelinePhase::Teardown:
return "Teardown";
}
return "Unknown";
}
} // namespace
LogDispatcher::LogDispatcher(BoundedChannel<LogEntry>& channel)
: channel_(channel) {}
void LogDispatcher::Run() {
auto logger = spdlog::default_logger();
while (true) {
auto entry = channel_.Receive();
if (!entry.has_value()) {
// Channel is closed and drained.
break;
}
const auto& log = entry.value();
logger->log(ToSpdlogLevel(log.level),
"{:<20} │ thread: {:016x} │ [{}:{}] │ {}",
PipelinePhaseToString(log.phase),
std::hash<std::thread::id>{}(log.thread_id),
log.origin.file_name(), log.origin.line(), log.message);
}
}
spdlog::level::level_enum LogDispatcher::ToSpdlogLevel(LogLevel level) {
switch (level) {
case LogLevel::Debug:
return spdlog::level::debug;
case LogLevel::Info:
return spdlog::level::info;
case LogLevel::Warn:
return spdlog::level::warn;
case LogLevel::Error:
return spdlog::level::err;
}
return spdlog::level::info;
}

View File

@@ -0,0 +1,19 @@
/**
* @file src/services/logging/log_producer.cc
* @brief LogProducer implementation for asynchronous pipeline logging.
*/
#include "services/logging/log_producer.h"
#include <chrono>
#include <optional>
#include <string>
#include <string_view>
#include "concurrency/bounded_channel.h"
#include "services/logging/log_entry.h"
LogProducer::LogProducer(BoundedChannel<LogEntry>& channel)
: channel_(channel) {}
void LogProducer::DoLog(LogEntry entry) { channel_.Send(std::move(entry)); }

View File

@@ -0,0 +1,100 @@
/**
* @file services/prompt_directory.cc
* @brief PromptDirectory implementation: validates the directory at
* construction and loads named prompt files on demand with in-process caching.
*/
#include "services/prompting/prompt_directory.h"
#include <chrono>
#include <filesystem>
#include <format>
#include <fstream>
#include <stdexcept>
#include <string>
#include <string_view>
#include <utility>
// ---------------------------------------------------------------------------
// PromptDirectory
// ---------------------------------------------------------------------------
PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir)
: PromptDirectory(prompt_dir, nullptr) {}
PromptDirectory::PromptDirectory(const std::filesystem::path& prompt_dir,
std::shared_ptr<ILogger> logger)
: prompt_dir_(prompt_dir), logger_(std::move(logger)) {
std::error_code ec;
// Scenario 4: directory must exist.
if (!std::filesystem::exists(prompt_dir_, ec) || ec) {
throw std::runtime_error(
"PromptDirectory: prompt directory does not exist: " +
prompt_dir_.string());
}
// Scenario 4: path must be a directory, not a file.
if (!std::filesystem::is_directory(prompt_dir_, ec) || ec) {
throw std::runtime_error(
"PromptDirectory: prompt directory path is not a directory: " +
prompt_dir_.string());
}
// Scenario 4: directory must be readable (probe with directory_iterator).
std::filesystem::directory_iterator probe(prompt_dir_, ec);
if (ec) {
throw std::runtime_error(
std::format("PromptDirectory: prompt directory is not readable: {} ({})",
prompt_dir_.string(), ec.message()));
}
if (logger_) {
logger_->Log(
{.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message =
std::string("[PromptDirectory] Resolved prompt directory: ") +
prompt_dir_.string()});
}
}
std::string PromptDirectory::Load(std::string_view key) {
const std::string key_str(key);
// Return cached content if already loaded during this run.
const auto cache_it = cache_.find(key_str);
if (cache_it != cache_.end()) {
return cache_it->second;
}
// Scenario 3: resolve <prompt_dir>/<key>.md and require it to exist.
const std::filesystem::path file_path =
prompt_dir_ / std::filesystem::path(std::format("{}.md", key_str));
std::ifstream file(file_path);
if (!file.is_open()) {
throw std::runtime_error(
std::format("PromptDirectory: prompt file not found for key '{}': {}",
key_str, file_path.string()));
}
std::string content((std::istreambuf_iterator<char>(file)),
std::istreambuf_iterator<char>());
file.close();
if (content.empty()) {
throw std::runtime_error(std::format("PromptDirectory: prompt file for key '{}' is empty: {}",
key_str, file_path.string()));
}
if (logger_) {
logger_->Log({.level = LogLevel::Info,
.phase = PipelinePhase::Startup,
.message = std::format("[PromptDirectory] Loaded prompt '{}' from '{}' ({} chars)",
key_str, file_path.string(), content.size())});
}
cache_.emplace(key_str, content);
return content;
}

View File

@@ -1,24 +0,0 @@
/**
* @file services/sqlite/build_database_path.cc
* @brief SqliteExportService::BuildDatabasePath() implementation.
*/
#include <filesystem>
#include <string>
#include "services/sqlite_export_service.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate =
std::filesystem::current_path() / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = std::filesystem::current_path() /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}

View File

@@ -5,9 +5,8 @@
#include <stdexcept> #include <stdexcept>
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h" #include "services/database/sqlite_export_service_helpers.h"
void SqliteExportService::Finalize() { void SqliteExportService::Finalize() {
if (db_handle_ == nullptr) { if (db_handle_ == nullptr) {

View File

@@ -1,5 +1,6 @@
#include "services/sqlite_connection_helpers.h" #include "services/database/sqlite_connection_helpers.h"
#include <format>
#include <stdexcept> #include <stdexcept>
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -10,7 +11,8 @@ void SqliteDatabaseDeleter::operator()(sqlite3* handle) const noexcept {
} }
} }
void SqliteStatementDeleter::operator()(sqlite3_stmt* statement) const noexcept { void SqliteStatementDeleter::operator()(
sqlite3_stmt* statement) const noexcept {
if (statement != nullptr) { if (statement != nullptr) {
sqlite3_finalize(statement); sqlite3_finalize(statement);
} }
@@ -19,11 +21,10 @@ void SqliteStatementDeleter::operator()(sqlite3_stmt* statement) const noexcept
void ThrowSqliteError(sqlite3* db_handle, std::string_view action) { void ThrowSqliteError(sqlite3* db_handle, std::string_view action) {
const std::string message = const std::string message =
db_handle != nullptr ? sqlite3_errmsg(db_handle) : "unknown SQLite error"; db_handle != nullptr ? sqlite3_errmsg(db_handle) : "unknown SQLite error";
throw std::runtime_error(std::string(action) + ": " + message); throw std::runtime_error(std::format("{}: {}", action, message));
} }
SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path) { SqliteDatabaseHandle OpenDatabase(const std::filesystem::path& path) {
sqlite3* raw_handle = nullptr; sqlite3* raw_handle = nullptr;
const int result = sqlite3_open(path.string().c_str(), &raw_handle); const int result = sqlite3_open(path.string().c_str(), &raw_handle);
@@ -50,11 +51,12 @@ void ExecSql(const SqliteDatabaseHandle& db_handle, std::string_view sql,
? error_message ? error_message
: sqlite3_errmsg(db_handle.get()); : sqlite3_errmsg(db_handle.get());
sqlite3_free(error_message); sqlite3_free(error_message);
throw std::runtime_error(std::string(action) + ": " + message); throw std::runtime_error(std::format("{}: {}", action, message));
} }
} }
void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept { void RollbackTransactionNoThrow(
const SqliteDatabaseHandle& db_handle) noexcept {
if (!db_handle) { if (!db_handle) {
return; return;
} }
@@ -63,4 +65,3 @@ void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept
} }
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal

View File

@@ -1,11 +1,12 @@
#include "services/sqlite_statement_helpers.h" #include "services/database/sqlite_statement_helpers.h"
#include "services/sqlite_connection_helpers.h"
#include <cstring>
#include <memory>
#include <limits>
#include <stdexcept>
#include <boost/json.hpp> #include <boost/json.hpp>
#include <cstring>
#include <limits>
#include <memory>
#include <stdexcept>
#include "services/database/sqlite_connection_helpers.h"
namespace sqlite_export_service_internal { namespace sqlite_export_service_internal {
@@ -86,16 +87,6 @@ sqlite3_int64 LastInsertRowId(const SqliteDatabaseHandle& db_handle) {
return sqlite3_last_insert_rowid(db_handle.get()); return sqlite3_last_insert_rowid(db_handle.get());
} }
std::string SerializeLocalLanguages(
const std::vector<std::string>& local_languages) {
boost::json::array array;
array.reserve(local_languages.size());
for (const auto& language : local_languages) {
array.emplace_back(language);
}
return boost::json::serialize(array);
}
std::string SerializeVector(const std::vector<std::string>& str_vec) { std::string SerializeVector(const std::vector<std::string>& str_vec) {
boost::json::array array(str_vec.size()); boost::json::array array(str_vec.size());
for (const auto& s : str_vec) { for (const auto& s : str_vec) {
@@ -105,4 +96,3 @@ std::string SerializeVector(const std::vector<std::string>& str_vec) {
} }
} // namespace sqlite_export_service_internal } // namespace sqlite_export_service_internal

View File

@@ -4,13 +4,27 @@
*/ */
#include <filesystem> #include <filesystem>
#include <format>
#include <memory> #include <memory>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h" #include "services/database/sqlite_export_service_helpers.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path(std::format("biergarten_seed_{}-{}.sqlite",
run_timestamp_utc_, suffix));
}
return candidate;
}
void SqliteExportService::InitializeSchema() const { void SqliteExportService::InitializeSchema() const {
sqlite_export_service_internal::ExecSql( sqlite_export_service_internal::ExecSql(
@@ -46,7 +60,6 @@ void SqliteExportService::RollbackAndCloseNoThrow() noexcept {
location_cache_.clear(); location_cache_.clear();
} }
void SqliteExportService::Initialize() { void SqliteExportService::Initialize() {
if (db_handle_ != nullptr) { if (db_handle_ != nullptr) {
throw std::runtime_error("SQLite export service is already initialized"); throw std::runtime_error("SQLite export service is already initialized");

View File

@@ -3,11 +3,13 @@
* @brief SqliteExportService::ProcessRecord() implementation. * @brief SqliteExportService::ProcessRecord() implementation.
*/ */
#include <iomanip>
#include <sstream>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h" #include "services/database/sqlite_export_service_helpers.h"
constexpr int kLocationPrecision = 17; constexpr int kLocationPrecision = 17;

View File

@@ -3,12 +3,13 @@
* @brief SqliteExportService constructor and destructor implementation. * @brief SqliteExportService constructor and destructor implementation.
*/ */
#include "services/sqlite_export_service.h" #include "services/database/sqlite_export_service.h"
#include <memory> #include <memory>
SqliteExportService::SqliteExportService() SqliteExportService::SqliteExportService(const ApplicationOptions& options)
: date_time_provider_(std::make_unique<SystemDateTimeProvider>()) {} : date_time_provider_(std::make_unique<SystemDateTimeProvider>()),
output_path_(options.pipeline.output_path) {}
SqliteExportService::~SqliteExportService() { SqliteExportService::~SqliteExportService() {
if (db_handle_ != nullptr) { if (db_handle_ != nullptr) {

View File

@@ -1,61 +0,0 @@
/**
* @file wikipedia/fetch_extract.cc
* @brief WikipediaService::FetchExtract() implementation.
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <string>
#include <string_view>
#include "services/wikipedia_service.h"
std::string WikipediaService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
const auto cache_it = this->extract_cache_.find(cache_key);
if (cache_it != this->extract_cache_.end()) {
return cache_it->second;
}
const std::string encoded = this->client_->UrlEncode(cache_key);
const std::string url =
"https://en.wikipedia.org/w/api.php?action=query&titles=" + encoded +
"&prop=extracts&explaintext=1&format=json";
const std::string body = this->client_->Get(url);
boost::system::error_code parse_error;
boost::json::value doc = boost::json::parse(body, parse_error);
if (!parse_error && doc.is_object()) {
try {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
const std::string_view extract_view = page.at("extract").as_string();
std::string extract(extract_view);
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
this->extract_cache_.emplace(cache_key, extract);
return extract;
}
}
this->extract_cache_.emplace(cache_key, std::string{});
} catch (const std::exception& e) {
spdlog::warn(
"WikipediaService: failed to parse response structure for '{}': "
"{}",
query, e.what());
return {};
}
} else if (parse_error) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
parse_error.message());
}
return {};
}

View File

@@ -1,47 +0,0 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <string>
#include "services/wikipedia_service.h"
std::string WikipediaService::GetLocationContext(const Location& loc) {
if (!client_) {
return {};
}
std::string result;
std::string region_query(loc.city);
if (!loc.country.empty()) {
region_query += ", ";
region_query += loc.country;
}
const std::string beer_query = "beer in " + loc.country;
const std::string city_beer_query = "beer in " + loc.city;
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(region_query));
append_extract(FetchExtract(beer_query));
append_extract(FetchExtract(city_beer_query));
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", region_query,
e.what());
}
return result;
}

View File

@@ -1,11 +0,0 @@
/**
* @file services/wikipedia/wikipedia_service.cc
* @brief WikipediaService constructor implementation.
*/
#include "services/wikipedia_service.h"
#include <utility>
WikipediaService::WikipediaService(std::unique_ptr<WebClient> client)
: client_(std::move(client)) {}

View File

@@ -1,19 +0,0 @@
/**
* @file web_client/curl_global_state.cc
* @brief CurlGlobalState constructor and destructor implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include "web_client/curl_web_client.h"
CurlGlobalState::CurlGlobalState() {
if (curl_global_init(CURL_GLOBAL_DEFAULT) != CURLE_OK) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl globally");
}
}
CurlGlobalState::~CurlGlobalState() { curl_global_cleanup(); }

View File

@@ -1,86 +0,0 @@
/**
* @file web_client/curl_web_client_get.cc
* @brief CURLWebClient::Get() implementation.
*/
#include <curl/curl.h>
#include <cstdint>
#include <limits>
#include <memory>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
using CurlHandle = std::unique_ptr<CURL, decltype(&curl_easy_cleanup)>;
static constexpr long kConnectionTimeout = 10;
static constexpr long kRequestTimeout = 30;
static constexpr int32_t kOkHttpStatus = 200;
static CurlHandle CreateHandle() {
CURL* handle = curl_easy_init();
if (handle == nullptr) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl handle");
}
return {handle, &curl_easy_cleanup};
}
static void SetCommonGetOptions(CURL* curl, const std::string& url) {
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, "biergarten-pipeline/0.1.0");
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
curl_easy_setopt(curl, CURLOPT_MAXREDIRS, 5L);
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, kConnectionTimeout);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, kRequestTimeout);
curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip");
}
// curl write callback that appends response data into a std::string
static size_t WriteCallbackString(void* contents, const size_t size,
const size_t nmemb, void* userp) {
const size_t real_size = size * nmemb;
auto* str = static_cast<std::string*>(userp);
str->append(static_cast<char*>(contents), real_size);
return real_size;
}
std::string CURLWebClient::Get(const std::string& url) {
const CurlHandle curl = CreateHandle();
std::string response_string;
SetCommonGetOptions(curl.get(), url);
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, WriteCallbackString);
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &response_string);
CURLcode curl_result = curl_easy_perform(curl.get());
if (curl_result != CURLE_OK) {
const auto error = std::string("[CURLWebClient] GET failed: ") +
curl_easy_strerror(curl_result);
throw std::runtime_error(error);
}
long curl_http_code = 0;
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &curl_http_code);
if (curl_http_code < std::numeric_limits<int32_t>::min() ||
curl_http_code > std::numeric_limits<int32_t>::max()) {
throw std::runtime_error("[CURLWebClient] Invalid HTTP status code: " +
std::to_string(curl_http_code));
}
const int32_t http_code = static_cast<int32_t>(curl_http_code);
if (http_code != kOkHttpStatus) {
const std::string error = "[CURLWebClient] HTTP error " +
std::to_string(http_code) + " for URL " + url;
throw std::runtime_error(error);
}
return response_string;
}

View File

@@ -1,24 +0,0 @@
/**
* @file web_client/curl_web_client_url_encode.cc
* @brief CURLWebClient::UrlEncode() implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
std::string CURLWebClient::UrlEncode(const std::string& value) {
// A NULL handle is fine for UTF-8 encoding according to libcurl docs.
char* output = curl_easy_escape(nullptr, value.c_str(), 0);
if (!output) {
throw std::runtime_error("[CURLWebClient] curl_easy_escape failed");
}
std::string result(output);
curl_free(output);
return result;
}

View File

@@ -0,0 +1,73 @@
/**
* @file web_client/http_web_client.cc
* @brief cpp-httplib implementation of WebClient.
*/
#include "web_client/http_web_client.h"
#include <httplib.h>
#include <chrono>
#include <format>
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include "services/logging/logger.h"
namespace {
constexpr time_t kConnectionTimeoutSeconds = 5;
constexpr time_t kReadTimeoutSeconds = 10;
constexpr int kSuccessMin = 200;
constexpr int kSuccessMax = 300;
const std::regex kUrlRegex(
R"(^(https?://[^/?#]+)(/[^?#]*(?:\?[^#]*)?(?:#.*)?)?)");
std::pair<std::string, std::string> SplitUrl(const std::string& url) {
std::smatch match;
if (!std::regex_match(url, match, kUrlRegex)) {
throw std::invalid_argument("[HttpWebClient] Malformed URL: " + url);
}
return {match[1].str(), match[2].matched ? match[2].str() : "/"};
}
} // namespace
std::string HttpWebClient::Get(const std::string& url) {
const auto [origin, path] = SplitUrl(url);
httplib::Client client(origin);
client.set_follow_location(true);
client.set_connection_timeout(kConnectionTimeoutSeconds);
client.set_read_timeout(kReadTimeoutSeconds);
client.set_default_headers({{"Accept", "application/json"},
{"User-Agent", "biergarten-pipeline/1.0"}});
const httplib::Result result = client.Get(path);
if (!result) {
throw std::runtime_error(std::format(
"[HttpWebClient] Request failed for URL: {} — {}", url,
httplib::to_string(result.error())));
}
if (result->status < kSuccessMin || result->status >= kSuccessMax) {
if (logger_) {
logger_->Log(
{.level = LogLevel::Error,
.phase = PipelinePhase::UserGeneration,
.message =
std::format("[HttpWebClient] Request failed for URL: {}", url)});
}
throw std::runtime_error(std::format("[HttpWebClient] HTTP {} for URL: {}",
result->status, url));
}
return result->body;
}
std::string HttpWebClient::EncodeURL(const std::string& value) {
return httplib::encode_uri_component(value);
}