5 Commits

Author SHA1 Message Date
Aaron Po
271c6fa99f update docs 2026-05-01 18:30:38 -04:00
Aaron Po
316fda1775 codebase formatting 2026-05-01 17:40:37 -04:00
Aaron Po
91e18888fe readability updates: remove magic numbers, update comments 2026-05-01 17:38:16 -04:00
Aaron Po
9051f55114 add prompt dir app option 2026-05-01 12:25:05 -04:00
Aaron Po
01849062d5 Refactor ApplicationOptions to separate config concerns 2026-05-01 00:40:21 -04:00
71 changed files with 980 additions and 1393 deletions

2
.gitattributes vendored
View File

@@ -1 +1 @@
archive/** linguist-vendored
archive/* linguist-vendored

View File

@@ -18,7 +18,6 @@ descriptions via a local GGUF model or a deterministic mock.
- [Build](#build)
- [Model](#model)
- [Run](#run)
- [Docker / RunPod](#docker--runpod)
- [Architecture](#architecture)
- [Pipeline Stages](#pipeline-stages)
- [Key Components](#key-components)
@@ -52,7 +51,7 @@ step.
### Build
Requirements: C++20 compiler, CMake 3.31+, OpenSSL, Boost (JSON and
Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and
ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system
SQLite package is required.
@@ -61,16 +60,6 @@ cmake -S . -B build
cmake --build build
```
CMake automatically detects whether a compatible llama.cpp installation is
present on the system (`libllama`, `libggml`, `libggml-base`, and `llama.h`
visible on the default search paths). If found, it links against those
libraries and skips the FetchContent build. If not found, it fetches and builds
llama.cpp from source at tag `b9012`. No additional flags are required in
either case.
Metal is enabled automatically on Apple Silicon. CUDA or HIP/ROCm is detected
automatically on Linux when the relevant toolkit is present.
### Model
> Skip this step if you only need `--mocked`.
@@ -85,124 +74,33 @@ curl -L \
### Run
Run from `build/` so the copied `locations.json` and `prompts/` are available.
Each run writes a fresh dated SQLite file such as
Each run also writes a fresh dated SQLite file such as
`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory.
```bash
./biergarten-pipeline --mocked
./biergarten-pipeline \
--model ../models/google_gemma-4-E4B-it-Q6_K.gguf \
--prompt-dir prompts \
--temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
```
#### CLI Flags
| Flag | Purpose |
| --------------- | ---------------------------------------------------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--prompt-dir` | Directory containing prompt files (e.g. `BREWERY_GENERATION.md`). Required unless `--mocked` is set. |
| `--output, -o` | Directory for generated SQLite artifacts. Default: `output`. |
| `--log-path` | Path for application logs. Default: `pipeline.log`. |
| `--temperature` | Sampling temperature. Default: `1.0`. |
| `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--top-k` | Top-k sampling. Default: `64`. |
| `--n-ctx` | Context window size. Default: `8192`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). |
| `--help, -h` | Print usage and exit. |
| Flag | Purpose |
| --------------- | ------------------------------------------------------- |
| `--mocked` | Deterministic mock generator, no model required. |
| `--model, -m` | Path to a GGUF file. Required unless `--mocked` is set. |
| `--temperature` | Sampling temperature. Default: `1.0`. |
| `--top-p` | Nucleus sampling. Default: `0.95`. |
| `--top-k` | Top-k sampling. Default: `64`. |
| `--n-ctx` | Context window size. Default: `8192`. |
| `--seed` | Random seed. Default: `-1` (random at runtime). |
| `--help, -h` | Print usage and exit. |
`--mocked` and `--model` are mutually exclusive. Omitting both exits with an
error before the pipeline starts. Sampling flags are ignored when `--mocked` is
set.
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after
editing any prompt file.
---
## Docker / RunPod
The `tooling/pipeline/runpod/` directory contains a GPU-ready container
configuration for running the pipeline on RunPod or any Docker host with an
NVIDIA GPU.
### How it works
The container uses a two-stage build. The first stage pulls prebuilt
`libllama`, `libggml`, and backend plugin libraries (including `libggml-cuda.so`
and the CPU variant plugins) from `ghcr.io/ggml-org/llama.cpp:full-cuda`. The
second stage copies those libraries into `/usr/local/lib` and runs `ldconfig` so
the dynamic linker and `dlopen` calls from `ggml_backend_load_all()` can resolve
the CUDA backend plugin at runtime. llama.cpp headers are cloned at the matching
tag and installed into `/usr/local/include`. CMake auto-detects both and skips
the FetchContent source build entirely, keeping image build times short.
`GGML_BACKEND_PATH` is set to `/usr/local/lib` so llama.cpp knows where to scan
for backend plugins.
### Build the image
Run from the `tooling/pipeline/` directory (the CMake project root), not from
inside `runpod/`, so the `COPY . .` step picks up the full project context.
```bash
docker build -t biergarten-pipeline:latest -f runpod/Dockerfile .
```
To monitor the full build output and confirm CMake selects the system llama.cpp:
```bash
docker build \
--progress=plain \
--no-cache \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```
Look for `[biergarten] Found system llama.cpp — skipping FetchContent` in the
output to confirm the fast path was taken.
### Run in mocked mode
No model or GPU required. Useful for validating the pipeline logic and SQLite
export path.
```bash
docker run --rm \
-e BIERGARTEN_MODE=mocked \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
### Run in live mode
Mount your GGUF model before starting. The container validates the model path
before launching the binary.
```bash
docker run --rm \
--runtime=nvidia \
-e BIERGARTEN_MODE=live \
-e GGML_BACKEND_PATH="/usr/local/lib/libggml-cuda.so" \
-v "$PWD/models:/workspace/models" \
-v "$PWD/output:/workspace/output" \
-v "$PWD/logs:/workspace/logs" \
biergarten-pipeline:latest
```
The model must be present at `./models/google_gemma-4-E4B-it-Q6_K.gguf` on the
host. See [Model](#model) above for the download command.
### RunPod deployment
Use a GPU pod template. Mount persistent storage for `/workspace/models`,
`/workspace/output`, and `/workspace/logs`. Set `BIERGARTEN_MODE=live` in the
template environment. See `tooling/pipeline/runpod/pod-template.yaml` for a
starter template.
editing `prompts/system.md`.
---
@@ -299,18 +197,16 @@ code, latitude, and longitude for each entry.
## Tech Stack
- C++20
- CMake 3.31+
- CMake 3.24+
- Boost.JSON, Boost.ProgramOptions, Boost.DI
- spdlog
- cpp-httplib (with OpenSSL)
- libcurl
- SQLite amalgamation fetched and compiled via CMake FetchContent
- llama.cpp (auto-detected from system install or fetched via FetchContent)
- Docker with NVIDIA CUDA 12.6 base image for GPU container builds
- RunPod for cloud GPU inference
- llama.cpp
The build fetches Boost.DI, spdlog, and SQLite via CMake. llama.cpp is fetched
only when a system installation is not detected. Metal is enabled on Apple
Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is
enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit
is present.
> **Code Style:** Modern C++20 throughout — RAII for ownership,
> `std::unique_ptr` for injected dependencies, `std::optional` for parse
@@ -322,7 +218,7 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
## Tested Hardware
### ARM macOS M1 Pro
### ARM macOS - M1 Pro
| | |
| --------- | --------------------------------- |
@@ -333,7 +229,7 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| Model | Gemma 4 E4B |
| Inference | llama.cpp with Metal |
### x86_64 Linux NVIDIA RTX 2000
### x86_64 Linux - NVIDIA RTX 2000
| | |
| --------- | ------------------------------ |
@@ -344,15 +240,6 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| Model | Gemma 4 E4B |
| Inference | llama.cpp with CUDA 12.x |
### x86_64 Linux — Docker / RunPod (NVIDIA CUDA)
| | |
| --------- | ------------------------------------------- |
| Host | RunPod GPU pod |
| Base | nvidia/cuda:12.6.3-devel-ubuntu24.04 |
| Model | Gemma 4 E4B Q6_K |
| Inference | llama.cpp prebuilt CUDA backends via dlopen |
---
## Fixture Strategy
@@ -373,9 +260,8 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
| `includes/` | Public headers and shared models. |
| `src/` | Implementation files. |
| `locations.json` | Curated city input copied into the build tree. |
| `prompts/` | System prompts used by the model-backed path. |
| `prompts/` | System prompt used by the model-backed path. |
| `diagrams/` | Architecture and pipeline diagrams. |
| `tooling/pipeline/runpod/` | Dockerfile, launcher, and RunPod pod template. |
| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. |
---
@@ -390,7 +276,6 @@ Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
- `src/data_generation/llama/` — local inference, prompt loading, output
validation.
- `src/data_generation/mock/` — deterministic fallback.
- `tooling/pipeline/runpod/` — container build and runtime launcher.
---

View File

@@ -29,7 +29,7 @@ if (Are arguments valid?) then (no)
else (yes)
endif
:Init OpenSSL global state & LlamaBackendState;
:Init CurlGlobalState & LlamaBackendState;
:di::make_injector(...);
:injector.create<std::unique_ptr<BiergartenDataGenerator>>();
:BiergartenDataGenerator::Run();

View File

@@ -52,7 +52,7 @@ interface WebClient <<interface>> {
+ UrlEncode(value : const std::string&) : std::string
}
class HttpWebClient {
class CURLWebClient {
+ Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string
}
@@ -130,7 +130,7 @@ BiergartenDataGenerator *-- IExportService : owns
IEnrichmentService <|.. WikipediaService : implements
WikipediaService *-- WebClient : owns
WebClient <|.. HttpWebClient : implements
WebClient <|.. CURLWebClient : implements
DataGenerator <|.. MockGenerator : implements
DataGenerator <|.. LlamaGenerator : implements

View File

@@ -13,7 +13,7 @@ if (Invalid args?) then (yes)
stop
else (no)
endif
:Init OpenSSL global state & LlamaBackendState;
:Init CurlGlobalState & LlamaBackendState;
:Build DI injector;
:Initialize SqliteExportService;

View File

@@ -356,7 +356,7 @@ package "Infrastructure: Enrichment" {
+ UrlEncode(value : const std::string&) : std::string
}
class HttpWebClient {
class CURLWebClient {
+ Get(url : const std::string&) : std::string
+ UrlEncode(value : const std::string&) : std::string
}
@@ -520,7 +520,7 @@ CheckinDistributionStrategy <|.. RandomCheckinStrategy
FollowGenerationStrategy <|.. RandomFollowStrategy
FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy
EnrichmentService <|.. WikipediaService
WebClient <|.. HttpWebClient
WebClient <|.. CURLWebClient
DataGenerator <|.. MockGenerator
DataGenerator <|.. LlamaGenerator
PromptFormatter <|.. Gemma4JinjaPromptFormatter

View File

@@ -1,9 +0,0 @@
build/
cmake-build-debug/
.git/
.idea/
**/*.sqlite
**/*.log
**/*.sqlite3
**/*.db

View File

@@ -1,255 +1,178 @@
cmake_minimum_required(VERSION 3.31)
cmake_minimum_required(VERSION 3.24)
project(biergarten-pipeline)
# Set policy to allow FetchContent_Populate for header-only libraries
# that have outdated CMakeLists.txt files
cmake_policy(SET CMP0169 OLD)
set(CMAKE_POLICY_VERSION_MINIMUM 3.5 CACHE STRING "" FORCE)
# 1. Build Options
option(BIERGARTEN_MOCK_ONLY "Build with mock data generators only — skips llama.cpp" OFF)
if(BIERGARTEN_MOCK_ONLY)
message(STATUS "[biergarten] MOCK_ONLY build — llama.cpp will not be compiled.")
endif()
# 2. Platform & GPU Detection
if(NOT UNIX)
message(FATAL_ERROR "[biergarten] Windows is not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
# =============================================================================
# 1. Platform & GPU Detection
# =============================================================================
if(WIN32)
message(FATAL_ERROR "[biergarten] Windows is currently not supported. Please use Linux (Fedora 43) or macOS (M1 Pro).")
endif()
if(APPLE)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64")
message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.")
set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE)
else()
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif()
else()
find_package(CUDAToolkit QUIET)
find_package(hip CONFIG QUIET)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64")
message(STATUS "[biergarten] Apple Silicon detected — enabling Metal acceleration.")
set(GGML_METAL ON CACHE BOOL "Enable Metal for Apple Silicon" FORCE)
else()
message(STATUS "[biergarten] Intel Mac detected — using CPU / Accelerate framework.")
set(GGML_METAL OFF CACHE BOOL "Disable Metal for Intel Macs" FORCE)
endif()
elseif(UNIX AND NOT APPLE)
find_package(CUDAToolkit QUIET)
find_package(HIP QUIET)
if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native)
elseif(hip_FOUND OR DEFINED ENV{ROCM_PATH} OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else()
message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.")
endif()
if(CUDAToolkit_FOUND)
message(STATUS "[biergarten] NVIDIA GPU detected — enabling CUDA acceleration.")
set(GGML_CUDA ON CACHE BOOL "Enable CUDA for NVIDIA GPUs" FORCE)
set(CMAKE_CUDA_ARCHITECTURES native)
elseif(HIP_FOUND OR EXISTS "/opt/rocm")
message(STATUS "[biergarten] AMD GPU detected — enabling HIP/ROCm acceleration.")
set(GGML_HIPBLAS ON CACHE BOOL "Enable HIP for AMD GPUs" FORCE)
else()
message(STATUS "[biergarten] No NVIDIA or AMD GPU found — falling back to CPU.")
endif()
endif()
# 3. Project-wide Settings
# =============================================================================
# 2. Project-wide Settings (Standard & Optimization)
# =============================================================================
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
# Release Build Optimization: Aggressive (-O3), Arch-specific, and LTO
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -march=native -flto")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -g")
# 4. Dependencies
# =============================================================================
# 3. Dependencies
# =============================================================================
include(FetchContent)
# Boost (system install — via dnf/brew)
find_package(CURL QUIET)
if(NOT CURL_FOUND)
message(FATAL_ERROR "[biergarten] libcurl not found. Install it (e.g. 'sudo dnf install libcurl-devel').")
endif()
# Require system Boost for JSON and Program Options to speed up build times
find_package(Boost REQUIRED COMPONENTS json program_options)
# Boost.DI (unofficial Boost extension, must declare separately from main Boost dependency)
# Header-only library, so we only fetch without invoking its CMakeLists.txt
FetchContent_Declare(
boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0
GIT_SHALLOW TRUE
sqlite_amalgamation
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
)
FetchContent_GetProperties(boost-di)
if(NOT boost-di_POPULATED)
FetchContent_Populate(boost-di)
FetchContent_GetProperties(sqlite_amalgamation)
if(NOT sqlite_amalgamation_POPULATED)
FetchContent_Populate(sqlite_amalgamation)
endif()
add_library(boost_di INTERFACE)
add_library(boost::di ALIAS boost_di)
target_include_directories(boost_di INTERFACE
$<BUILD_INTERFACE:${boost-di_SOURCE_DIR}/include>
)
# SQLite amalgamation
FetchContent_Declare(
sqlite_amalgamation
URL https://www.sqlite.org/2026/sqlite-amalgamation-3530000.zip
URL_HASH SHA3_256=c2325c53b3b41761469f91cfb078e96882ac5d85bac10c11b0bd8f253b031e5b
EXCLUDE_FROM_ALL
)
FetchContent_MakeAvailable(sqlite_amalgamation)
if(NOT TARGET sqlite3)
add_library(sqlite3 STATIC ${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c)
target_include_directories(sqlite3 PUBLIC ${sqlite_amalgamation_SOURCE_DIR})
target_compile_definitions(sqlite3 PUBLIC SQLITE_THREADSAFE=1)
add_library(sqlite3 STATIC
${sqlite_amalgamation_SOURCE_DIR}/sqlite3.c
)
target_include_directories(sqlite3 PUBLIC
${sqlite_amalgamation_SOURCE_DIR}
)
target_compile_definitions(sqlite3 PUBLIC
SQLITE_THREADSAFE=1
)
endif()
# llama.cpp — skipped for mock-only builds
if(NOT BIERGARTEN_MOCK_ONLY)
find_library(LLAMA_LIB NAMES llama)
find_library(GGML_LIB NAMES ggml)
find_library(GGML_BASE_LIB NAMES ggml-base)
find_path(LLAMA_INC_DIR NAMES llama.h PATH_SUFFIXES include)
if(LLAMA_LIB AND GGML_LIB AND GGML_BASE_LIB AND LLAMA_INC_DIR)
message(STATUS "[biergarten] Found system llama.cpp — skipping FetchContent")
add_library(llama SHARED IMPORTED)
set_target_properties(llama PROPERTIES
IMPORTED_LOCATION "${LLAMA_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INC_DIR}"
INTERFACE_LINK_LIBRARIES "${GGML_LIB};${GGML_BASE_LIB}"
)
else()
message(STATUS "[biergarten] System llama.cpp not found — fetching via FetchContent")
FetchContent_Declare(
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b9012
)
FetchContent_MakeAvailable(llama-cpp)
endif()
endif()
# spdlog
FetchContent_Declare(
spdlog
GIT_REPOSITORY https://github.com/gabime/spdlog.git
GIT_TAG v1.15.3
llama-cpp
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp.git
GIT_TAG b8742
)
FetchContent_MakeAvailable(llama-cpp)
FetchContent_Declare(
boost-di
GIT_REPOSITORY https://github.com/boost-ext/di.git
GIT_TAG v1.3.0
)
FetchContent_MakeAvailable(boost-di)
if(TARGET Boost.DI AND NOT TARGET boost::di)
add_library(boost::di ALIAS Boost.DI)
endif()
FetchContent_Declare(
spdlog
GIT_REPOSITORY https://github.com/gabime/spdlog.git
GIT_TAG v1.15.3
)
FetchContent_MakeAvailable(spdlog)
# cpp-httplib — header-only HTTP/HTTPS client replacing libcurl.
# OpenSSL is required for HTTPS (Wikipedia API). find_package locates
# libssl/libcrypto; HTTPLIB_REQUIRE_OPENSSL causes a hard build failure
# if OpenSSL is absent rather than silently producing an HTTP-only binary.
find_package(OpenSSL REQUIRED)
FetchContent_Declare(
cpp-httplib
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib.git
GIT_TAG v0.43.2
GIT_SHALLOW TRUE
SYSTEM
)
set(HTTPLIB_REQUIRE_OPENSSL ON CACHE BOOL "Require OpenSSL for cpp-httplib" FORCE)
FetchContent_MakeAvailable(cpp-httplib)
# 5. Executable & Sources
add_executable(${PROJECT_NAME}
includes/services/enrichment/mock_enrichment.h)
# --- Entry point ---
target_sources(${PROJECT_NAME} PRIVATE
src/main.cc
# =============================================================================
# 4. Sources
# =============================================================================
set(SOURCES
src/main.cc
src/biergarten_data_generator/biergarten_data_generator.cc
src/biergarten_data_generator/run.cc
src/biergarten_data_generator/query_cities_with_countries.cc
src/biergarten_data_generator/generate_breweries.cc
src/biergarten_data_generator/log_results.cc
src/services/wikipedia/wikipedia_service.cc
src/services/wikipedia/get_summary.cc
src/services/wikipedia/fetch_extract.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/build_database_path.cc
src/services/sqlite/process_record.cc
src/services/sqlite/initialize.cc
src/services/sqlite/finalize.cc
src/web_client/curl_global_state.cc
src/web_client/curl_web_client_get.cc
src/web_client/curl_web_client_url_encode.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/generate_user.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/load.cc
src/services/prompt_directory.cc
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
src/data_generation/mock/deterministic_hash.cc
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/json_handling/json_loader.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cpp
src/services/sqlite/helpers/sqlite_statement_helpers.cpp
)
# --- json_handling ---
target_sources(${PROJECT_NAME} PRIVATE
src/json_handling/json_loader.cc
)
# --- application_options ---
target_sources(${PROJECT_NAME} PRIVATE
src/application_options/parse_arguments.cc
)
# --- biergarten_data_generator ---
target_sources(${PROJECT_NAME} PRIVATE
src/biergarten_data_generator/log_results.cc
src/biergarten_data_generator/biergarten_data_generator.cc
src/biergarten_data_generator/generate_breweries.cc
src/biergarten_data_generator/run.cc
src/biergarten_data_generator/query_cities_with_countries.cc
)
# --- web_client ---
target_sources(${PROJECT_NAME} PRIVATE
src/web_client/http_web_client.cc
)
# --- data_generation: prompt_formatting ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.cc
)
# --- data_generation: mock ---
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/mock/generate_brewery.cc
src/data_generation/mock/generate_user.cc
src/data_generation/mock/deterministic_hash.cc
)
# --- data_generation: llama (skipped for mock-only builds) ---
if(NOT BIERGARTEN_MOCK_ONLY)
target_sources(${PROJECT_NAME} PRIVATE
src/data_generation/llama/load.cc
src/data_generation/llama/helpers.cc
src/data_generation/llama/generate_brewery.cc
src/data_generation/llama/infer.cc
src/data_generation/llama/llama_generator.cc
src/data_generation/llama/generate_user.cc
)
endif()
# --- services: wikipedia ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/enrichment/wikipedia/wikipedia_service.cc
src/services/enrichment/wikipedia/fetch_extract.cc
src/services/enrichment/wikipedia/get_summary.cc
)
# --- services: sqlite ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/sqlite/process_record.cc
src/services/sqlite/sqlite_export_service.cc
src/services/sqlite/finalize.cc
src/services/sqlite/initialize.cc
src/services/sqlite/helpers/sqlite_connection_helpers.cc
src/services/sqlite/helpers/sqlite_statement_helpers.cc
)
# --- services (top-level) ---
target_sources(${PROJECT_NAME} PRIVATE
src/services/prompt_directory.cc
)
# 6. Include Directories, Link Libraries & Compile Definitions
# =============================================================================
# 5. Target
# =============================================================================
add_executable(${PROJECT_NAME} ${SOURCES})
target_include_directories(${PROJECT_NAME} PRIVATE
includes
includes
${llama-cpp_SOURCE_DIR}/include
${llama-cpp_SOURCE_DIR}/common
)
target_link_libraries(${PROJECT_NAME} PRIVATE
$<$<NOT:$<BOOL:${BIERGARTEN_MOCK_ONLY}>>:llama>
boost::di
Boost::json
Boost::program_options
spdlog::spdlog
sqlite3
httplib::httplib
OpenSSL::SSL
OpenSSL::Crypto
llama
boost::di
Boost::json
Boost::program_options
spdlog::spdlog
sqlite3
CURL::libcurl
)
target_compile_definitions(${PROJECT_NAME} PRIVATE
# Defined when -DBIERGARTEN_MOCK_ONLY=ON — skips llama.cpp entirely.
# Use #ifdef BIERGARTEN_MOCK_ONLY in source to guard llama-specific code.
$<$<BOOL:${BIERGARTEN_MOCK_ONLY}>:BIERGARTEN_MOCK_ONLY>
# Defined for Debug configuration builds.
# Use #ifdef DEBUG in source to enable debug-only behaviour (e.g. verbose logging).
$<$<CONFIG:Debug>:DEBUG>
)
# 7. Runtime Assets
# =============================================================================
# 6. Runtime Assets
# =============================================================================
configure_file(
${CMAKE_SOURCE_DIR}/locations.json
${CMAKE_BINARY_DIR}/locations.json
COPYONLY
${CMAKE_SOURCE_DIR}/locations.json
${CMAKE_BINARY_DIR}/locations.json
COPYONLY
)
add_custom_command(TARGET ${PROJECT_NAME} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_directory
${CMAKE_SOURCE_DIR}/prompts
${CMAKE_BINARY_DIR}/prompts
COMMAND ${CMAKE_COMMAND} -E copy_directory
${CMAKE_SOURCE_DIR}/prompts
${CMAKE_BINARY_DIR}/prompts
)

View File

@@ -11,9 +11,11 @@
#include <vector>
#include "data_generation/data_generator.h"
#include "data_model/generated_models.h"
#include "services/database/export_service.h"
#include "services/enrichment/enrichment_service.h"
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#include "data_model/location.h"
#include "services/enrichment_service.h"
#include "services/export_service.h"
/**
* @brief Main data generator class for the Biergarten pipeline.
@@ -32,8 +34,7 @@ class BiergartenDataGenerator {
*/
BiergartenDataGenerator(std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter,
const ApplicationOptions& application_options);
std::unique_ptr<IExportService> exporter);
/**
* @brief Run the data generation pipeline.
@@ -57,14 +58,12 @@ class BiergartenDataGenerator {
/// @brief Storage backend for generated brewery records.
std::unique_ptr<IExportService> exporter_;
const ApplicationOptions application_options_;
/**
* @brief Load locations from JSON and sample cities.
*
* @return Vector of sampled locations capped at 50 entries.
*/
std::vector<Location> QueryCitiesWithCountries();
static std::vector<Location> QueryCitiesWithCountries();
/**
* @brief Generate breweries for enriched cities.

View File

@@ -8,7 +8,9 @@
#include <string>
#include "data_model/generated_models.h"
#include "data_model/brewery_result.h"
#include "data_model/location.h"
#include "data_model/user_result.h"
/**
* @brief Interface for data generator implementations.

View File

@@ -14,10 +14,10 @@
#include <string>
#include <string_view>
#include "../services/prompting/prompt_directory.h"
#include "data_generation/data_generator.h"
#include "data_generation/prompt_formatting/prompt_formatter.h"
#include "data_model/models.h"
#include "data_model/application_options.h"
#include "services/prompt_directory.h"
struct llama_model;
struct llama_context;
@@ -129,7 +129,6 @@ class LlamaGenerator final : public DataGenerator {
uint32_t sampling_top_k_ = kDefaultSamplingTopK;
std::mt19937 rng_;
uint32_t n_ctx_ = kDefaultContextSize;
int n_gpu_layers_ = 0;
std::unique_ptr<IPromptFormatter> prompt_formatter_;
std::unique_ptr<IPromptDirectory> prompt_directory_;
};

View File

@@ -12,7 +12,7 @@
#include <string>
#include <string_view>
#include "data_model/generated_models.h"
#include "data_model/brewery_result.h"
struct llama_vocab;
using llama_token = int32_t;

View File

@@ -1,5 +1,4 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_
#pragma once
#include <string>
#include <string_view>
@@ -14,5 +13,3 @@ class Gemma4JinjaPromptFormatter final : public IPromptFormatter {
[[nodiscard]] std::string Format(std::string_view system_prompt,
std::string_view user_prompt) const override;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_GEMMA4_JINJA_PROMPT_FORMATTER_H_

View File

@@ -1,5 +1,4 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_
#pragma once
#include <string>
#include <string_view>
@@ -16,5 +15,3 @@ class IPromptFormatter {
[[nodiscard]] virtual std::string Format(
std::string_view system_prompt, std::string_view user_prompt) const = 0;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_GENERATION_PROMPT_FORMATTING_PROMPT_FORMATTER_H_

View File

@@ -0,0 +1,72 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_
/**
* @file data_model/application_options.h
* @brief Program options for the Biergarten pipeline application.
*/
#include <cstdint>
#include <filesystem>
#include <optional>
#include <string>
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_APPLICATION_OPTIONS_H_

View File

@@ -0,0 +1,22 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_
/**
* @file data_model/brewery_location.h
* @brief Non-owning brewery location input.
*/
#include <string_view>
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_LOCATION_H_

View File

@@ -0,0 +1,28 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_
/**
* @file data_model/brewery_result.h
* @brief Generated brewery payload.
*/
#include <string>
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_BREWERY_RESULT_H_

View File

@@ -0,0 +1,21 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_
/**
* @file data_model/enriched_city.h
* @brief Enriched city data with Wikipedia context.
*/
#include <string>
#include "data_model/location.h"
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_ENRICHED_CITY_H_

View File

@@ -0,0 +1,20 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_
/**
* @file data_model/generated_brewery.h
* @brief Helper struct to store generated brewery data.
*/
#include "data_model/brewery_result.h"
#include "data_model/location.h"
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_BREWERY_H_

View File

@@ -1,66 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_
/**
* @file data_model/generated_models.h
* @brief Generated output models from the pipeline: brewery/user results, enriched data,
* and complete generation results.
*/
#include <string>
#include "data_model/models.h"
// ============================================================================
// Generation Output Models
// ============================================================================
/**
* @brief Generated brewery payload.
*/
struct BreweryResult {
/// @brief Brewery display name in English.
std::string name_en;
/// @brief Brewery description text in English.
std::string description_en;
/// @brief Brewery display name in the local language.
std::string name_local;
/// @brief Brewery description text in the local language.
std::string description_local;
};
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
// ============================================================================
// Pipeline Data Models
// ============================================================================
/**
* @brief Enriched city data with Wikipedia context.
*/
struct EnrichedCity {
Location location;
std::string region_context{};
};
/**
* @brief Helper struct to store generated brewery data.
*/
struct GeneratedBrewery {
Location location;
BreweryResult brewery;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATED_MODELS_H_

View File

@@ -0,0 +1,13 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_
/**
* @file data_model/generation_models.h
* @brief Convenience include for shared generation payload models.
*/
#include "data_model/brewery_location.h"
#include "data_model/brewery_result.h"
#include "data_model/user_result.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_GENERATION_MODELS_H_

View File

@@ -0,0 +1,41 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_
/**
* @file data_model/location.h
* @brief Location data model used throughout generation pipeline.
*/
#include <string>
#include <vector>
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_LOCATION_H_

View File

@@ -1,141 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_
/**
* @file data_model/models.h
* @brief Core data models: locations, application configuration, and generation
* inputs.
*/
#include <boost/program_options.hpp>
#include <cstdint>
#include <filesystem>
#include <optional>
#include <string>
#include <string_view>
#include <vector>
namespace prog_opts = boost::program_options;
// ============================================================================
// Location Models
// ============================================================================
/**
* @brief Canonical location record for city-level generation.
*/
struct Location {
/// @brief City name.
std::string city{};
/// @brief State or province name.
std::string state_province{};
/// @brief ISO 3166-2 subdivision code.
std::string iso3166_2{};
/// @brief Country name.
std::string country{};
/// @brief ISO 3166-1 country code.
std::string iso3166_1{};
/// @brief Local language codes in priority order.
std::vector<std::string> local_languages{};
/// @brief Latitude in decimal degrees.
double latitude{};
/// @brief Longitude in decimal degrees.
double longitude{};
};
/**
* @brief Non-owning brewery location input.
*/
struct BreweryLocation {
/// @brief City name.
std::string_view city_name;
/// @brief Country name.
std::string_view country_name;
};
// ============================================================================
// Configuration Models
// ============================================================================
/**
* @brief LLM sampling parameters.
*/
struct SamplingOptions {
/// @brief LLM sampling temperature (0.0 to 1.0, higher = more random).
float temperature = 1.0F;
/// @brief LLM nucleus sampling top-p parameter.
float top_p = 0.95F;
/// @brief LLM top-k sampling parameter.
uint32_t top_k = 64;
/// @brief Context window size (tokens).
uint32_t n_ctx = 8192;
/// @brief Random seed (-1 for random, otherwise non-negative).
int seed = -1;
/// @brief Number of layers to offload to GPU.
int n_gpu_layers = 0;
};
/**
* @brief Configuration for the LLM generator component.
*/
struct GeneratorOptions {
/// @brief Path to the LLM model file (gguf format).
std::filesystem::path model_path;
/// @brief Use mocked generator instead of actual LLM inference.
bool use_mocked = false;
/// @brief Specific sampling parameters for this generator.
/// If nullopt, the application should use global defaults.
std::optional<SamplingOptions> sampling;
};
/**
* @brief Configuration for the pipeline execution and output.
*/
struct PipelineOptions {
/// @brief Directory for generated artifacts.
std::filesystem::path output_path;
/// @brief Directory that contains named prompt files (e.g.
/// BREWERY_GENERATION.md).
std::filesystem::path prompt_dir;
/// @brief Path for application logs.
std::filesystem::path log_path;
/// @brief Number of locations to sample from the dataset
/// More locations -> more users/more breweries
uint32_t location_count;
};
/**
* @brief Root configuration object for the Biergarten pipeline.
*/
struct ApplicationOptions {
GeneratorOptions generator;
PipelineOptions pipeline;
};
// ============================================================================
// Function Declarations
// ============================================================================
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv);
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_MODELS_H_

View File

@@ -0,0 +1,12 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_
/**
* @file data_model/pipeline_models.h
* @brief Convenience include for pipeline-specific data models.
*/
#include "data_model/enriched_city.h"
#include "data_model/generated_brewery.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_PIPELINE_MODELS_H_

View File

@@ -0,0 +1,22 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_
/**
* @file data_model/user_result.h
* @brief Generated user profile payload.
*/
#include <string>
/**
* @brief Generated user profile payload.
*/
struct UserResult {
/// @brief Username handle.
std::string username{};
/// @brief Short user biography.
std::string bio{};
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_DATA_MODEL_USER_RESULT_H_

View File

@@ -9,7 +9,7 @@
#include <filesystem>
#include <vector>
#include "data_model/models.h"
#include "data_model/location.h"
/// @brief Loads curated world locations from a JSON file into memory.
class JsonLoader {

View File

@@ -1,10 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "sqlite_connection_helpers.h"
#include "sqlite_handle_types.h"
#include "sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_
/**
* @file services/date_time_provider.h
@@ -63,4 +63,4 @@ class SystemDateTimeProvider final : public IDateTimeProvider {
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_DATE_TIME_PROVIDER_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATE_TIME_PROVIDER_H_

View File

@@ -1,35 +0,0 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_
#include <chrono>
/**
* @file services/timer.h
* @brief Simple timer utility for measuring elapsed time.
*/
class Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
public:
Timer(const Timer&) = delete;
Timer& operator=(const Timer&) = delete;
Timer(Timer&&) = delete;
Timer& operator=(Timer&&) = delete;
Timer() = default;
~Timer() = default;
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
[[nodiscard]] int64_t Reset() {
auto previous_elapsed = Elapsed();
start_time = std::chrono::steady_clock::now();
return previous_elapsed;
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATETIME_TIMER_H_

View File

@@ -1,17 +0,0 @@
//
// Created by aaronpo on 13/05/2026.
//
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_
#include <string>
#include "enrichment_service.h"
class MockEnrichmentService final : public IEnrichmentService {
public:
std::string GetLocationContext(const Location& /*loc*/) override {
return {};
}
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_MOCK_ENRICHMENT_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_
/**
* @file services/enrichment_service.h
@@ -8,7 +8,7 @@
#include <string>
#include "data_model/models.h"
#include "data_model/location.h"
/**
* @brief Interface for services that can enrich a location with context.
@@ -27,4 +27,4 @@ class IEnrichmentService {
virtual std::string GetLocationContext(const Location& loc) = 0;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_ENRICHMENT_SERVICE_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_SERVICE_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_
/**
* @file services/export_service.h
@@ -8,7 +8,7 @@
#include <cstdint>
#include "data_model/generated_models.h"
#include "data_model/generated_brewery.h"
/**
* @brief Interface for services that persist generated brewery records.
@@ -39,4 +39,4 @@ class IExportService {
virtual void Finalize() = 0;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_EXPORT_SERVICE_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_EXPORT_SERVICE_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_
/**
* @file services/prompt_directory.h
@@ -73,4 +73,4 @@ class PromptDirectory final : public IPromptDirectory {
std::unordered_map<std::string, std::string> cache_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPTING_PROMPT_DIRECTORY_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_PROMPT_DIRECTORY_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_
/**
* @file services/sqlite_connection_helpers.h
@@ -12,7 +12,7 @@
#include <string>
#include <string_view>
#include "sqlite_handle_types.h"
#include "services/sqlite_handle_types.h"
namespace sqlite_export_service_internal {
@@ -27,4 +27,4 @@ void RollbackTransactionNoThrow(const SqliteDatabaseHandle& db_handle) noexcept;
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_CONNECTION_HELPERS_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_CONNECTION_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_
/**
* @file services/sqlite_export_service.h
@@ -11,10 +11,10 @@
#include <string>
#include <unordered_map>
#include "data_model/models.h"
#include "../datetime/date_time_provider.h"
#include "export_service.h"
#include "sqlite_export_service_helpers.h"
#include "data_model/application_options.h"
#include "services/date_time_provider.h"
#include "services/export_service.h"
#include "services/sqlite_export_service_helpers.h"
/**
* @brief Persists generated brewery records into a fresh SQLite database.
@@ -57,4 +57,4 @@ class SqliteExportService final : public IExportService {
std::unordered_map<std::string, sqlite3_int64> location_cache_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_EXPORT_SERVICE_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_H_

View File

@@ -0,0 +1,10 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_
/* Umbrella header for backward compatibility. */
#include "services/sqlite_connection_helpers.h"
#include "services/sqlite_handle_types.h"
#include "services/sqlite_statement_helpers.h"
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_EXPORT_SERVICE_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_
/**
* Shared handle and parameter type declarations used by SQLite helper units.
@@ -33,4 +33,4 @@ struct BindParam {
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_HANDLE_TYPES_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_HANDLE_TYPES_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_
/**
* @file services/sqlite_statement_helpers.h
@@ -13,7 +13,7 @@
#include <string_view>
#include <vector>
#include "sqlite_handle_types.h"
#include "services/sqlite_handle_types.h"
namespace sqlite_export_service_internal {
@@ -113,4 +113,4 @@ std::string SerializeVector(const std::vector<std::string>& str_vec);
} // namespace sqlite_export_service_internal
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_DATABASE_SQLITE_STATEMENT_HELPERS_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_SQLITE_STATEMENT_HELPERS_H_

View File

@@ -1,5 +1,5 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
#ifndef BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_
#define BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_
/**
* @file services/wikipedia_service.h
@@ -11,14 +11,14 @@
#include <string_view>
#include <unordered_map>
#include "enrichment_service.h"
#include "services/enrichment_service.h"
#include "web_client/web_client.h"
/// @brief Provides Wikipedia summary lookups backed by cached raw extracts.
class WikipediaEnrichmentService final : public IEnrichmentService {
class WikipediaService final : public IEnrichmentService {
public:
/// @brief Creates a new Wikipedia service with the provided web client.
explicit WikipediaEnrichmentService(std::unique_ptr<WebClient> client);
explicit WikipediaService(std::unique_ptr<WebClient> client);
/// @brief Returns the Wikipedia-derived context for a location.
[[nodiscard]] std::string GetLocationContext(const Location& loc) override;
@@ -30,4 +30,4 @@ class WikipediaEnrichmentService final : public IEnrichmentService {
std::unordered_map<std::string, std::string> extract_cache_;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_ENRICHMENT_WIKIPEDIA_SERVICE_H_
#endif // BIERGARTEN_PIPELINE_INCLUDES_SERVICES_WIKIPEDIA_SERVICE_H_

View File

@@ -0,0 +1,54 @@
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_
/**
* @file web_client/curl_web_client.h
* @brief libcurl-based WebClient implementation.
*/
#include "web_client/web_client.h"
/**
* @brief RAII wrapper for curl_global_init and curl_global_cleanup.
*
* Create one instance in application startup before using libcurl and keep it
* alive for application lifetime.
*/
class CurlGlobalState {
public:
/// @brief Initializes global libcurl state.
CurlGlobalState();
/// @brief Cleans up global libcurl state.
~CurlGlobalState();
/// @brief Non-copyable type.
CurlGlobalState(const CurlGlobalState&) = delete;
/// @brief Non-copyable type.
CurlGlobalState& operator=(const CurlGlobalState&) = delete;
};
/**
* @brief WebClient implementation backed by libcurl.
*/
class CURLWebClient : public WebClient {
public:
/**
* @brief Executes an HTTP GET request.
*
* @param url Request URL.
* @return Response body.
*/
std::string Get(const std::string& url) override;
/**
* @brief URL-encodes a string value.
*
* @param value Raw value.
* @return URL-encoded string.
*/
std::string UrlEncode(const std::string& value) override;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_CURL_WEB_CLIENT_H_

View File

@@ -1,49 +0,0 @@
/**
* @file web_client/http_web_client.h
* @brief cpp-httplib implementation of the WebClient interface.
*/
#ifndef BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#define BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_HTTP_WEB_CLIENT_H_
#include "web_client/web_client.h"
#include <string>
/**
* @brief WebClient implementation backed by cpp-httplib.
*
* Supports HTTP and HTTPS (requires OpenSSL; see HTTPLIB_REQUIRE_OPENSSL
* in CMakeLists.txt).
*
* URL parsing splits a full URL into origin (scheme://host[:port]) and
* path + query so that httplib::Client can be constructed correctly.
* A new client instance is created per request because the client is
* bound to a single origin at construction time.
*/
class HttpWebClient final : public WebClient {
public:
HttpWebClient() = default;
~HttpWebClient() override = default;
/**
* @brief Executes a blocking HTTP/HTTPS GET request against a full URL.
*
* @param url Fully-qualified URL, e.g. "https://en.wikipedia.org/api/rest_v1/page/summary/Berlin"
* @return Response body on HTTP 2xx; throws std::runtime_error otherwise.
*/
std::string Get(const std::string& url) override;
/**
* @brief Percent-encodes a single URI component (query parameter value or
* path segment). Delegates to httplib::encode_uri_component().
*
* @param value Raw string to encode.
* @return Percent-encoded string safe for use in a URL.
*/
std::string EncodeURL(const std::string& value) override;
};
#endif

View File

@@ -30,7 +30,7 @@ class WebClient {
* @param value Raw string value.
* @return Encoded value safe for URL usage.
*/
virtual std::string EncodeURL(const std::string& value) = 0;
virtual std::string UrlEncode(const std::string& value) = 0;
};
#endif // BIERGARTEN_PIPELINE_INCLUDES_WEB_CLIENT_WEB_CLIENT_H_

View File

@@ -1,9 +0,0 @@
# Ignore model files!
*.gguf
*.bin
models/
weights/
# Ignore local build folders
build/
.git/

View File

@@ -1,72 +0,0 @@
# --- Stage 1: Build Environment (The "Heavy" Stage) ---
FROM nvidia/cuda:12.6.3-devel-ubuntu24.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive \
CMAKE_GENERATOR=Ninja
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential ca-certificates curl git libboost-json-dev \
libboost-program-options-dev libssl-dev ninja-build pkg-config zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install modern CMake
RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.31.0/cmake-3.31.0-linux-x86_64.sh -o cmake.sh && \
sh cmake.sh --skip-license --prefix=/usr/local && rm cmake.sh
# Get headers for C++ build
RUN curl -L https://github.com/ggml-org/llama.cpp/archive/refs/tags/b9012.tar.gz -o /tmp/llama-src.tar.gz && \
tar -xzf /tmp/llama-src.tar.gz -C /tmp && \
cp -r /tmp/llama.cpp-b9012/include/* /usr/local/include/ && \
cp -r /tmp/llama.cpp-b9012/ggml/include/* /usr/local/include/
# Pull llama.cpp binaries to use during build if needed
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
WORKDIR /app
COPY . .
# Build the C++ pipeline
RUN cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc)
# --- Stage 2: Runtime Environment (The "Slim" Stage) ---
FROM nvidia/cuda:12.6.3-runtime-ubuntu24.04 AS runtime
# Install only necessary runtime shared libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
libboost-json1.83.0 \
libboost-program-options1.83.0 \
libgomp1 \
libssl3 \
zlib1g \
&& rm -rf /var/lib/apt/lists/*
ENV APP_ROOT=/app \
LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"
WORKDIR /app/build
# Copy only the compiled binaries from the builder
COPY --from=builder /app/build/biergarten-pipeline ./
# Copy required config files
COPY locations.json /app/build/
COPY beer-styles.json /app/build/
# Copy prompt templates
COPY prompts /app/prompts
# Copy only the necessary shared libraries from builder/llama-bin
COPY --from=ghcr.io/ggml-org/llama.cpp:full-cuda /app/lib*.so* /usr/local/lib/
# Co-locate plugins
RUN cp /usr/local/lib/libggml-cuda.so . 2>/dev/null || true && \
cp /usr/local/lib/libggml-cpu*.so . 2>/dev/null || true
# Setup Start Script
COPY ./runpod/start.sh /usr/local/bin/biergarten-start
RUN chmod +x /usr/local/bin/biergarten-start
ENTRYPOINT ["/usr/local/bin/biergarten-start"]

View File

@@ -1,8 +0,0 @@
```bash
touch runpod/start.sh
docker build \
--progress=plain \
-t biergarten-pipeline:latest \
-f runpod/Dockerfile \
. 2>&1 | tee build.log
```

View File

@@ -1,22 +0,0 @@
name: biergarten-pipeline-live
imageName: biergarten-pipeline:latest
category: NVIDIA
containerDiskInGb: 50
volumeInGb: 50
volumeMountPath: /workspace
dockerEntrypoint:
- /usr/local/bin/biergarten-start
dockerStartCmd: []
isPublic: false
isServerless: false
env:
BIERGARTEN_MODE: live
BIERGARTEN_MODEL_PATH: /workspace/models/google_gemma-4-E4B-it-Q6_K.gguf
BIERGARTEN_PROMPT_DIR: /workspace/app/build/prompts
BIERGARTEN_OUTPUT_DIR: /workspace/output
BIERGARTEN_LOG_PATH: /workspace/logs/pipeline.log
BIERGARTEN_TEMPERATURE: "1.0"
BIERGARTEN_TOP_P: "0.95"
BIERGARTEN_TOP_K: "64"
BIERGARTEN_N_CTX: "8192"
BIERGARTEN_SEED: "-1"

View File

@@ -1,58 +0,0 @@
#!/bin/bash
set -e
MODEL_PATH="${BIERGARTEN_MODEL_PATH:-/workspace/models/google_gemma-4-E4B-it-Q6_K.gguf}"
OUTPUT_DIR="${BIERGARTEN_OUTPUT_DIR:-/workspace/output}"
LOG_PATH="${BIERGARTEN_LOG_PATH:-/workspace/logs/pipeline.log}"
EXECUTABLE="/app/build/biergarten-pipeline"
PROMPT_DIR="/app/prompts"
echo "--- Starting Biergarten Pipeline Environment Check ---"
# Ensure directories exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$(dirname "$LOG_PATH")"
mkdir -p "$(dirname "$MODEL_PATH")"
# Download model if missing
if [ ! -f "$MODEL_PATH" ]; then
echo "Model not found. Downloading (this may take a while)..."
curl -L -C - \
-o "$MODEL_PATH" \
"https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true"
echo "Download complete."
fi
# Verify model exists
if [ ! -f "$MODEL_PATH" ]; then
echo "ERROR: Model still not found after download attempt."
exit 1
fi
# Default GPU layers
GL_LAYERS="${BIERGARTEN_GL_LAYERS:-40}"
# Build args
ARGS=(
"--model" "$MODEL_PATH"
"--prompt-dir" "$PROMPT_DIR"
"--output" "$OUTPUT_DIR"
"--log-path" "$LOG_PATH"
"--n-gpu-layers" "$GL_LAYERS"
)
# Optional params
[[ -n "$BIERGARTEN_TEMPERATURE" ]] && ARGS+=("--temperature" "$BIERGARTEN_TEMPERATURE")
[[ -n "$BIERGARTEN_TOP_P" ]] && ARGS+=("--top-p" "$BIERGARTEN_TOP_P")
[[ -n "$BIERGARTEN_TOP_K" ]] && ARGS+=("--top-k" "$BIERGARTEN_TOP_K")
[[ -n "$BIERGARTEN_N_CTX" ]] && ARGS+=("--n-ctx" "$BIERGARTEN_N_CTX")
[[ -n "$BIERGARTEN_SEED" ]] && ARGS+=("--seed" "$BIERGARTEN_SEED")
# Extra args
[[ -n "$BIERGARTEN_EXTRA_ARGS" ]] && ARGS+=($BIERGARTEN_EXTRA_ARGS)
echo "--- Executing: $EXECUTABLE ${ARGS[*]} ---"
exec "$EXECUTABLE" "${ARGS[@]}"

View File

@@ -1,158 +0,0 @@
#include <spdlog/spdlog.h>
#include <optional>
#include <sstream>
#include <string>
#include "data_model/models.h"
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Defaults sourced from SamplingOptions{} so the CLI and LlamaGenerator
// share a single source of truth — changing the struct updates both.
auto add_sampling_options = [&]() -> void {
const SamplingOptions sampling_defaults{};
opt("temperature",
prog_opts::value<float>()->default_value(sampling_defaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(sampling_defaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(sampling_defaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(sampling_defaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
opt("n-gpu-layers", prog_opts::value<int>()->default_value(0),
"Number of layers to offload to GPU");
};
// --mocked and --model are mutually exclusive; validation is enforced below
// rather than at registration to produce a clear diagnostic message.
auto add_generator_options = [&]() -> void {
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
};
auto add_pipeline_options = [&]() -> void {
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
opt("location-count", prog_opts::value<uint32_t>()->default_value(10));
};
add_sampling_options();
add_generator_options();
add_pipeline_options();
// No flags provided — treat as a help request rather than an error.
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
options.pipeline.location_count =
var_map["location-count"].as<uint32_t>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
const int n_gpu_layers = var_map["n-gpu-layers"].as<int>();
// Enforce mutual exclusivity before any further configuration is applied.
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: either --mocked or --model must be specified");
return std::nullopt;
}
// Prompt directory is only meaningful for live inference — the mock
// generator has no use for it and should not require it to be present.
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
spdlog::error(
"Invalid arguments: --prompt-dir is required when not using "
"--mocked");
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
// options.generator.n_gpu_layers = n_gpu_layers;
// Only populate sampling config when the user explicitly overrides at
// least one value. Leaving it as std::nullopt lets LlamaGenerator fall
// back to its own SamplingOptions{} defaults, keeping the two paths
// consistent without redundant copies.
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted() || !var_map["n_gpu_layers"].defaulted();
if (user_provided_sampling) {
// Warn but do not fail — the run is still valid, the flags are just
// silently irrelevant when no model is loaded.
if (use_mocked) {
spdlog::warn("Sampling parameters are ignored when using --mocked");
} else {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
sampling.n_gpu_layers = var_map["n-gpu-layers"].as<int>();
options.generator.sampling = sampling;
}
}
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}

View File

@@ -10,9 +10,7 @@
BiergartenDataGenerator::BiergartenDataGenerator(
std::unique_ptr<IEnrichmentService> context_service,
std::unique_ptr<DataGenerator> generator,
std::unique_ptr<IExportService> exporter,
const ApplicationOptions &app_options)
std::unique_ptr<IExportService> exporter)
: context_service_(std::move(context_service)),
generator_(std::move(generator)),
exporter_(std::move(exporter)),
application_options_(app_options) {}
exporter_(std::move(exporter)) {}

View File

@@ -13,6 +13,8 @@
#include "biergarten_data_generator.h"
#include "json_handling/json_loader.h"
static constexpr size_t kBreweryAmount = 50;
std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
spdlog::info("\n=== GEOGRAPHIC DATA OVERVIEW ===");
@@ -21,9 +23,7 @@ std::vector<Location> BiergartenDataGenerator::QueryCitiesWithCountries() {
auto all_locations = JsonLoader::LoadLocations(locations_path);
spdlog::info(" Locations available: {}", all_locations.size());
const size_t sample_count = std::min(
static_cast<size_t>(application_options_.pipeline.location_count),
all_locations.size());
const size_t sample_count = std::min(kBreweryAmount, all_locations.size());
const auto sample_count_signed =
static_cast<std::iter_difference_t<decltype(all_locations.cbegin())>>(

View File

@@ -21,8 +21,8 @@ bool BiergartenDataGenerator::Run() {
for (auto& city : cities) {
try {
std::string region_context = context_service_->GetLocationContext(city);
// spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}",
// city.city, city.iso3166_2, region_context);
spdlog::debug("[Pipeline] Context for '{}' ({}) gathered:\n{}",
city.city, city.country, region_context);
enriched.push_back(
EnrichedCity{.location = std::move(city),

View File

@@ -11,7 +11,7 @@
#include <stdexcept>
#include <string>
#include "data_model/models.h"
#include "data_model/application_options.h"
#include "llama.h"
static constexpr uint32_t kMaxContextSize = 32768U;
@@ -89,7 +89,6 @@ LlamaGenerator::LlamaGenerator(
}
n_ctx_ = sampling.n_ctx;
n_gpu_layers_ = sampling.n_gpu_layers;
this->Load(model_path);
}

View File

@@ -12,7 +12,6 @@
#include <utility>
#include "data_generation/llama_generator.h"
#include "ggml-backend.h"
#include "llama.h"
// Maximum batch size for decode operations. Capping the batch prevents
@@ -23,12 +22,7 @@ void LlamaGenerator::Load(const std::string& model_path) {
context_.reset();
model_.reset();
// Specifically load dynamic ggml backends (like CUDA) that are provided
// externally before attempting to load a model.
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers_;
const llama_model_params model_params = llama_model_default_params();
LlamaGenerator::ModelHandle loaded_model(
llama_model_load_from_file(model_path.c_str(), model_params));
if (!loaded_model) {

View File

@@ -8,43 +8,178 @@
#include <boost/di.hpp>
#include <boost/program_options.hpp>
#include <chrono>
#include <cstdint>
#include <exception>
#include <memory>
#include <optional>
#include <sstream>
#include <string>
#include "biergarten_data_generator.h"
#include "data_generation/llama_generator.h"
#include "data_generation/mock_generator.h"
#include "data_generation/prompt_formatting/gemma4_jinja_prompt_formatter.h"
#include "data_model/models.h"
#include "data_model/application_options.h"
#include "llama_backend_state.h"
#include "services/database/export_service.h"
#include "services/database/sqlite_export_service.h"
#include "services/datetime/timer.h"
#include "services/enrichment/enrichment_service.h"
#include "services/enrichment/mock_enrichment.h"
#include "services/enrichment/wikipedia_service.h"
#include "services/prompting/prompt_directory.h"
#include "web_client/http_web_client.h"
#include "services/enrichment_service.h"
#include "services/export_service.h"
#include "services/prompt_directory.h"
#include "services/sqlite_export_service.h"
#include "services/wikipedia_service.h"
#include "web_client/curl_web_client.h"
namespace prog_opts = boost::program_options;
namespace di = boost::di;
/**
* @brief Parse command-line arguments into ApplicationOptions.
*
* @param argc Command-line argument count.
* @param argv Command-line arguments.
* @return Parsed ApplicationOptions if parsing succeeded, std::nullopt
* otherwise.
*/
std::optional<ApplicationOptions> ParseArguments(const int argc, char** argv) {
prog_opts::options_description desc("Pipeline Options");
auto opt = desc.add_options();
opt("help,h", "Produce help message");
// Generator Options
opt("mocked", prog_opts::bool_switch(),
"Use mocked generator for brewery/user data");
opt("model,m", prog_opts::value<std::string>()->default_value(""),
"Path to LLM model (gguf)");
// Sampling Options - defaults driven from SamplingOptions struct
const SamplingOptions kSamplingDefaults{};
opt("temperature",
prog_opts::value<float>()->default_value(kSamplingDefaults.temperature),
"Sampling temperature (higher = more random)");
opt("top-p",
prog_opts::value<float>()->default_value(kSamplingDefaults.top_p),
"Nucleus sampling top-p in (0,1] (higher = more random)");
opt("top-k",
prog_opts::value<uint32_t>()->default_value(kSamplingDefaults.top_k),
"Top-k sampling parameter (higher = more candidate tokens)");
opt("n-ctx",
prog_opts::value<uint32_t>()->default_value(kSamplingDefaults.n_ctx),
"Context window size in tokens");
opt("seed", prog_opts::value<int>()->default_value(kSamplingDefaults.seed),
"Sampler seed: -1 for random, otherwise non-negative integer");
// Pipeline Options
opt("output,o", prog_opts::value<std::string>()->default_value("output"),
"Directory for generated artifacts");
opt("log-path",
prog_opts::value<std::string>()->default_value("pipeline.log"),
"Path for application logs");
opt("prompt-dir", prog_opts::value<std::string>()->default_value(""),
"Directory containing named prompt files (e.g. BREWERY_GENERATION.md)."
" Required when not using --mocked.");
if (argc == 1) {
spdlog::info("Biergarten Pipeline");
std::stringstream usage_stream;
usage_stream << "\nUsage: biergarten-pipeline [options]\n\n" << desc;
spdlog::info(usage_stream.str());
return std::nullopt;
}
try {
prog_opts::variables_map var_map;
prog_opts::store(prog_opts::parse_command_line(argc, argv, desc), var_map);
prog_opts::notify(var_map);
if (var_map.contains("help")) {
std::stringstream help_stream;
help_stream << "\n" << desc;
spdlog::info(help_stream.str());
return std::nullopt;
}
ApplicationOptions options;
options.pipeline.output_path = var_map["output"].as<std::string>();
options.pipeline.log_path = var_map["log-path"].as<std::string>();
options.pipeline.prompt_dir = var_map["prompt-dir"].as<std::string>();
const bool use_mocked = var_map["mocked"].as<bool>();
const std::string model_path = var_map["model"].as<std::string>();
if (use_mocked && !model_path.empty()) {
spdlog::error(
"Invalid arguments: --mocked and --model are mutually exclusive");
return std::nullopt;
}
if (!use_mocked && model_path.empty()) {
spdlog::error(
"Invalid arguments: Either --mocked or --model must be specified");
return std::nullopt;
}
if (!use_mocked && options.pipeline.prompt_dir.empty()) {
spdlog::error(
"Invalid arguments: --prompt-dir is required when not using "
"--mocked");
return std::nullopt;
}
options.generator.use_mocked = use_mocked;
options.generator.model_path = model_path;
const bool user_provided_sampling =
!var_map["temperature"].defaulted() || !var_map["top-p"].defaulted() ||
!var_map["top-k"].defaulted() || !var_map["n-ctx"].defaulted() ||
!var_map["seed"].defaulted();
if (use_mocked) {
if (user_provided_sampling) {
spdlog::warn("Sampling parameters are ignored when using --mocked");
}
} else if (user_provided_sampling) {
SamplingOptions sampling;
sampling.temperature = var_map["temperature"].as<float>();
sampling.top_p = var_map["top-p"].as<float>();
sampling.top_k = var_map["top-k"].as<uint32_t>();
sampling.n_ctx = var_map["n-ctx"].as<uint32_t>();
sampling.seed = var_map["seed"].as<int>();
options.generator.sampling = sampling;
}
return options;
} catch (const std::exception& exception) {
spdlog::error("Failed to parse command-line arguments: {}",
exception.what());
return std::nullopt;
} catch (...) {
spdlog::error("Failed to parse command-line arguments: unknown error");
return std::nullopt;
}
}
struct Timer {
std::chrono::steady_clock::time_point start_time =
std::chrono::steady_clock::now();
[[nodiscard]] int64_t Elapsed() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start_time)
.count();
}
};
int main(const int argc, char** argv) {
try {
Timer timer;
const CurlGlobalState curl_state;
const LlamaBackendState llama_backend_state;
spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%^%l%$] %v");
#ifndef BIERGARTEN_MOCK_ONLY
const LlamaBackendState llama_backend_state;
#endif
#ifdef DEBUG
spdlog::set_level(spdlog::level::debug);
#endif
const std::optional<ApplicationOptions> parsed_options =
ParseArguments(argc, argv);
const auto parsed_options = ParseArguments(argc, argv);
if (!parsed_options.has_value()) {
return 0;
}
@@ -66,20 +201,12 @@ int main(const int argc, char** argv) {
}
const auto injector = di::make_injector(
di::bind<WebClient>().to<CURLWebClient>(),
di::bind<ApplicationOptions>().to(options),
di::bind<std::string>().to(model_path),
di::bind<WebClient>().to<HttpWebClient>(),
di::bind<IEnrichmentService>().to<WikipediaService>(),
di::bind<IExportService>().to<SqliteExportService>(),
di::bind<IPromptFormatter>().to<Gemma4JinjaPromptFormatter>(),
di::bind<IEnrichmentService>().to(
[options](const auto& inj) -> std::unique_ptr<IEnrichmentService> {
if (options.generator.use_mocked) {
return std::make_unique<MockEnrichmentService>();
}
return std::make_unique<WikipediaEnrichmentService>(
inj.template create<std::unique_ptr<WebClient>>());
}),
di::bind<std::string>().to(model_path),
di::bind<DataGenerator>().to(
[options, model_path, sampling, &prompt_directory](
const auto& inj) -> std::unique_ptr<DataGenerator> {
@@ -98,11 +225,9 @@ int main(const int argc, char** argv) {
options, model_path,
inj.template create<std::unique_ptr<IPromptFormatter>>(),
std::move(prompt_directory));
})
}));
);
const auto generator =
auto generator =
injector.create<std::unique_ptr<BiergartenDataGenerator>>();
if (!generator->Run()) {

View File

@@ -1,112 +0,0 @@
/**
* @file wikipedia/fetch_extract.cc
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <chrono>
#include <format>
#include <string>
#include <string_view>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
using namespace boost;
std::string WikipediaEnrichmentService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
// 1. Cache Lookup
if (const auto cache_it = this->extract_cache_.find(cache_key);
cache_it != this->extract_cache_.end()) {
spdlog::debug("Wikipedia: Cache hit for {}!", cache_key);
return cache_it->second;
}
const std::string encoded = this->client_->EncodeURL(cache_key);
const std::string url = std::format(
"https://en.wikipedia.org/w/"
"api.php?action=query&titles={}&prop=extracts&explaintext=1&format=json",
encoded);
const std::string body = this->client_->Get(url);
{
using namespace std::literals::chrono_literals;
std::this_thread::sleep_for(1s);
}
// 2. Parse JSON
system::error_code ec;
json::value doc = json::parse(body, ec);
if (ec) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
ec.message());
return {};
}
// 3. Safe Extraction
const json::object* obj = doc.if_object();
if (obj == nullptr) {
spdlog::warn("WikipediaService: Expected root object for '{}'", query);
return {};
}
const json::value* query_ptr = obj->if_contains("query");
const json::value* pages_ptr =
((query_ptr != nullptr) && query_ptr->is_object())
? query_ptr->get_object().if_contains("pages")
: nullptr;
if ((pages_ptr == nullptr) || !pages_ptr->is_object()) {
spdlog::warn("WikipediaService: Missing query.pages for '{}'", query);
return {};
}
const json::object& pages = pages_ptr->get_object();
if (pages.empty()) {
spdlog::warn("WikipediaService: No pages returned for '{}'", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
// Wikipedia returns the page under a dynamic ID key; we just want the first
// one
const json::value& page_val = pages.begin()->value();
if (!page_val.is_object()) {
spdlog::warn("WikipediaService: Unexpected page format for '{}'", query);
return {};
}
const json::object& page = page_val.get_object();
// Handle 404/Missing status
if (page.contains("missing")) {
spdlog::warn("WikipediaService: Page '{}' does not exist", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
const json::value* extract_ptr = page.if_contains("extract");
if ((extract_ptr == nullptr) || !extract_ptr->is_string()) {
spdlog::warn("WikipediaService: No extract string found for '{}'", query);
this->extract_cache_.emplace(cache_key, "");
return {};
}
// 4. Success
std::string extract(extract_ptr->as_string());
spdlog::info("WikipediaService: Fetched {} chars for '{}'", extract.size(),
query);
this->extract_cache_.insert_or_assign(cache_key, extract);
return extract;
}

View File

@@ -1,58 +0,0 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <chrono>
#include <format>
#include <string>
#include <thread>
#include "services/enrichment/wikipedia_service.h"
std::string WikipediaEnrichmentService::GetLocationContext(const Location& loc) {
using namespace std::literals::chrono_literals;
if (!this->client_) {
spdlog::warn("Client is nullptr.");
return {};
}
std::string result;
// std::string region_query(loc.city);
// if (!loc.country.empty()) {
// region_query += loc.state_province,
// region_query += ", ";
// region_query += loc.country;
// }
constexpr std::string_view brewing_query = "brewing";
const std::string location_query =
std::format("{}, {}", loc.city, loc.iso3166_2);
const std::string beer_query = std::format("beer in {}", loc.country);
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(brewing_query));
append_extract(FetchExtract(beer_query));
spdlog::info("Done fetching for {}. Sleeping for 10 seconds.",
location_query);
std::this_thread::sleep_for(10s);
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", location_query,
e.what());
}
return result;
}

View File

@@ -4,7 +4,7 @@
* construction and loads named prompt files on demand with in-process caching.
*/
#include "services/prompting/prompt_directory.h"
#include "services/prompt_directory.h"
#include <spdlog/spdlog.h>

View File

@@ -0,0 +1,23 @@
/**
* @file services/sqlite/build_database_path.cc
* @brief SqliteExportService::BuildDatabasePath() implementation.
*/
#include <filesystem>
#include <string>
#include "services/sqlite_export_service.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}

View File

@@ -5,8 +5,8 @@
#include <stdexcept>
#include "services/database/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h"
#include "services/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h"
void SqliteExportService::Finalize() {
if (db_handle_ == nullptr) {
@@ -28,4 +28,4 @@ void SqliteExportService::Finalize() {
RollbackAndCloseNoThrow();
throw;
}
}
}

View File

@@ -1,4 +1,4 @@
#include "services/database/sqlite_connection_helpers.h"
#include "services/sqlite_connection_helpers.h"
#include <stdexcept>

View File

@@ -1,4 +1,4 @@
#include "services/database/sqlite_statement_helpers.h"
#include "services/sqlite_statement_helpers.h"
#include <boost/json.hpp>
#include <cstring>
@@ -6,7 +6,7 @@
#include <memory>
#include <stdexcept>
#include "services/database/sqlite_connection_helpers.h"
#include "services/sqlite_connection_helpers.h"
namespace sqlite_export_service_internal {

View File

@@ -8,22 +8,8 @@
#include <stdexcept>
#include <string>
#include "services/database/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h"
std::filesystem::path SqliteExportService::BuildDatabasePath() const {
std::filesystem::path base_filename("biergarten_seed_" + run_timestamp_utc_ +
".sqlite");
std::filesystem::path candidate = output_path_ / base_filename;
for (int suffix = 1; std::filesystem::exists(candidate); ++suffix) {
candidate = output_path_ /
std::filesystem::path("biergarten_seed_" + run_timestamp_utc_ +
"-" + std::to_string(suffix) + ".sqlite");
}
return candidate;
}
#include "services/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h"
void SqliteExportService::InitializeSchema() const {
sqlite_export_service_internal::ExecSql(

View File

@@ -8,8 +8,8 @@
#include <stdexcept>
#include <string>
#include "services/database/sqlite_export_service.h"
#include "services/database/sqlite_export_service_helpers.h"
#include "services/sqlite_export_service.h"
#include "services/sqlite_export_service_helpers.h"
constexpr int kLocationPrecision = 17;

View File

@@ -3,7 +3,7 @@
* @brief SqliteExportService constructor and destructor implementation.
*/
#include "services/database/sqlite_export_service.h"
#include "services/sqlite_export_service.h"
#include <memory>

View File

@@ -0,0 +1,61 @@
/**
* @file wikipedia/fetch_extract.cc
* @brief WikipediaService::FetchExtract() implementation.
*/
#include <spdlog/spdlog.h>
#include <boost/json.hpp>
#include <string>
#include <string_view>
#include "services/wikipedia_service.h"
std::string WikipediaService::FetchExtract(std::string_view query) {
const std::string cache_key(query);
const auto cache_it = this->extract_cache_.find(cache_key);
if (cache_it != this->extract_cache_.end()) {
return cache_it->second;
}
const std::string encoded = this->client_->UrlEncode(cache_key);
const std::string url =
"https://en.wikipedia.org/w/api.php?action=query&titles=" + encoded +
"&prop=extracts&explaintext=1&format=json";
const std::string body = this->client_->Get(url);
boost::system::error_code parse_error;
boost::json::value doc = boost::json::parse(body, parse_error);
if (!parse_error && doc.is_object()) {
try {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
const std::string_view extract_view = page.at("extract").as_string();
std::string extract(extract_view);
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
this->extract_cache_.emplace(cache_key, extract);
return extract;
}
}
this->extract_cache_.emplace(cache_key, std::string{});
} catch (const std::exception& e) {
spdlog::warn(
"WikipediaService: failed to parse response structure for '{}': "
"{}",
query, e.what());
return {};
}
} else if (parse_error) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
parse_error.message());
}
return {};
}

View File

@@ -0,0 +1,47 @@
/**
* @file wikipedia/get_summary.cc
* @brief WikipediaService::GetLocationContext() implementation.
*/
#include <spdlog/spdlog.h>
#include <string>
#include "services/wikipedia_service.h"
std::string WikipediaService::GetLocationContext(const Location& loc) {
if (!client_) {
return {};
}
std::string result;
std::string region_query(loc.city);
if (!loc.country.empty()) {
region_query += ", ";
region_query += loc.country;
}
const std::string beer_query = "beer in " + loc.country;
const std::string city_beer_query = "beer in " + loc.city;
auto append_extract = [&result](const std::string& extract) -> void {
if (extract.empty()) {
return;
}
if (!result.empty()) {
result += "\n\n";
}
result += extract;
};
try {
append_extract(FetchExtract(region_query));
append_extract(FetchExtract(beer_query));
append_extract(FetchExtract(city_beer_query));
} catch (const std::runtime_error& e) {
spdlog::debug("WikipediaService lookup failed for '{}': {}", region_query,
e.what());
}
return result;
}

View File

@@ -3,10 +3,9 @@
* @brief WikipediaService constructor implementation.
*/
#include "services/enrichment/wikipedia_service.h"
#include "services/wikipedia_service.h"
#include <utility>
WikipediaEnrichmentService::WikipediaEnrichmentService(
std::unique_ptr<WebClient> client)
WikipediaService::WikipediaService(std::unique_ptr<WebClient> client)
: client_(std::move(client)) {}

View File

@@ -0,0 +1,19 @@
/**
* @file web_client/curl_global_state.cc
* @brief CurlGlobalState constructor and destructor implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include "web_client/curl_web_client.h"
CurlGlobalState::CurlGlobalState() {
if (curl_global_init(CURL_GLOBAL_DEFAULT) != CURLE_OK) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl globally");
}
}
CurlGlobalState::~CurlGlobalState() { curl_global_cleanup(); }

View File

@@ -0,0 +1,87 @@
/**
* @file web_client/curl_web_client_get.cc
* @brief CURLWebClient::Get() implementation.
*/
#include <curl/curl.h>
#include <cstdint>
#include <limits>
#include <memory>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
using CurlHandle = std::unique_ptr<CURL, decltype(&curl_easy_cleanup)>;
static constexpr long kConnectionTimeout = 10;
static constexpr long kRequestTimeout = 30;
static constexpr long kMaxRedirects = 5;
static constexpr int32_t kOkHttpStatus = 200;
static CurlHandle CreateHandle() {
CURL* handle = curl_easy_init();
if (handle == nullptr) {
throw std::runtime_error(
"[CURLWebClient] Failed to initialize libcurl handle");
}
return {handle, &curl_easy_cleanup};
}
static void SetCommonGetOptions(CURL* curl, const std::string& url) {
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, "biergarten-pipeline/0.1.0");
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
curl_easy_setopt(curl, CURLOPT_MAXREDIRS, kMaxRedirects);
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, kConnectionTimeout);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, kRequestTimeout);
curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip");
}
// curl write callback that appends response data into a std::string
static size_t WriteCallbackString(void* contents, const size_t size,
const size_t nmemb, void* userp) {
const size_t real_size = size * nmemb;
auto* str = static_cast<std::string*>(userp);
str->append(static_cast<char*>(contents), real_size);
return real_size;
}
std::string CURLWebClient::Get(const std::string& url) {
const CurlHandle curl = CreateHandle();
std::string response_string;
SetCommonGetOptions(curl.get(), url);
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, WriteCallbackString);
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &response_string);
CURLcode curl_result = curl_easy_perform(curl.get());
if (curl_result != CURLE_OK) {
const auto error = std::string("[CURLWebClient] GET failed: ") +
curl_easy_strerror(curl_result);
throw std::runtime_error(error);
}
long curl_http_code = 0;
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &curl_http_code);
if (curl_http_code < std::numeric_limits<int32_t>::min() ||
curl_http_code > std::numeric_limits<int32_t>::max()) {
throw std::runtime_error("[CURLWebClient] Invalid HTTP status code: " +
std::to_string(curl_http_code));
}
const int32_t http_code = static_cast<int32_t>(curl_http_code);
if (http_code != kOkHttpStatus) {
const std::string error = "[CURLWebClient] HTTP error " +
std::to_string(http_code) + " for URL " + url;
throw std::runtime_error(error);
}
return response_string;
}

View File

@@ -0,0 +1,24 @@
/**
* @file web_client/curl_web_client_url_encode.cc
* @brief CURLWebClient::UrlEncode() implementation.
*/
#include <curl/curl.h>
#include <stdexcept>
#include <string>
#include "web_client/curl_web_client.h"
std::string CURLWebClient::UrlEncode(const std::string& value) {
// A NULL handle is fine for UTF-8 encoding according to libcurl docs.
char* output = curl_easy_escape(nullptr, value.c_str(), 0);
if (!output) {
throw std::runtime_error("[CURLWebClient] curl_easy_escape failed");
}
std::string result(output);
curl_free(output);
return result;
}

View File

@@ -1,68 +0,0 @@
/**
* @file web_client/http_web_client.cc
* @brief cpp-httplib implementation of WebClient.
*/
#include "web_client/http_web_client.h"
#include <httplib.h>
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include "spdlog/spdlog.h"
namespace {
constexpr time_t kConnectionTimeoutSeconds = 5;
constexpr time_t kReadTimeoutSeconds = 10;
constexpr int kSuccessMin = 200;
constexpr int kSuccessMax = 300;
const std::regex kUrlRegex(
R"(^(https?://[^/?#]+)(/[^?#]*(?:\?[^#]*)?(?:#.*)?)?)");
std::pair<std::string, std::string> SplitUrl(const std::string& url) {
std::smatch match;
if (!std::regex_match(url, match, kUrlRegex)) {
throw std::invalid_argument("[HttpWebClient] Malformed URL: " + url);
}
return {match[1].str(), match[2].matched ? match[2].str() : "/"};
}
} // namespace
std::string HttpWebClient::Get(const std::string& url) {
const auto [origin, path] = SplitUrl(url);
httplib::Client client(origin);
client.set_follow_location(true);
client.set_connection_timeout(kConnectionTimeoutSeconds);
client.set_read_timeout(kReadTimeoutSeconds);
client.set_default_headers({
{"Accept", "application/json"},
{"User-Agent", "biergarten-pipeline/1.0"}
});
const httplib::Result result = client.Get(path);
if (!result) {
throw std::runtime_error(
"[HttpWebClient] Request failed for URL: " + url +
"" + httplib::to_string(result.error()));
}
if (result->status < kSuccessMin || result->status >= kSuccessMax) {
spdlog::error("[HttpWebClient] Request failed for URL: " + url);
throw std::runtime_error(
"[HttpWebClient] HTTP " + std::to_string(result->status) +
" for URL: " + url);
}
return result->body;
}
std::string HttpWebClient::EncodeURL(const std::string& value) {
return httplib::encode_uri_component(value);
}