Update readme

2026-07-16 17:47:22 +00:00 · 2026-04-18 12:56:34 -04:00
parent 9649c993e8
commit ce56532728
1 changed files with 177 additions and 57 deletions
--- a/pipeline/README.md
+++ b/pipeline/README.md
@@ -1,71 +1,112 @@
 # Biergarten Pipeline
-Biergarten Pipeline is a C++20 command-line tool that reads a local city list, resolves contextual enrichment for each sampled city through an injected service, and generates brewery names and descriptions. The current code samples up to four locations per run, then uses either Gemma 4 or the mock generator to produce the output.
+Biergarten Pipeline is a C++20 command-line pipeline that samples city records from local JSON, enriches each city with Wikipedia context, and generates bilingual brewery names and descriptions with either a local GGUF model or a deterministic mock generator.
 ## Key Components
 - `src/main.cc` wires dependencies with Boost.DI.
 - `JsonLoader` validates the curated location input.
 - `WikipediaService` caches extracts and returns empty context when a lookup fails.
 - `LlamaGenerator` formats prompts for Gemma 4, retries malformed output, and validates JSON before accepting it.
 - `MockGenerator` emits repeatable output for demos and smoke tests.
 - Brewery payloads include English and local-language name and description fields.
 A structural overview is available in [biergarten_pipeline.puml](biergarten_pipeline.puml).
 ## How It Fits The Main App
 The parent app uses this pipeline as brewery seed data. Planned brewery discovery, reviews, follows, and map features use the records it emits.
 The planned brewery features called out in the parent README map directly to the output this pipeline produces:
 | Planned app area                 | Pipeline contribution                                                     |
 | -------------------------------- | ------------------------------------------------------------------------- |
 | Brewery discovery and management | Sampled city records, localized brewery names, and long-form descriptions |
 | Beer reviews and ratings         | Stable brewery fixtures with enough context to anchor review pages        |
 | Social follow relationships      | Repeatable brewery entities for feeds, follows, and saved lists           |
 | Geospatial brewery experiences   | Latitude, longitude, and country-level metadata                           |
 | Additional frontend routes       | Deterministic fixture data for Storybook, demos, and browser tests        |
 The pipeline stays outside the web app runtime. That leaves room for a future import job, seed step, or fixture loader.
 ## Tested Hardware & OS
-### ARM MacOS, M1 Pro
+The local model path was run on both Apple Silicon and discrete-GPU Linux hardware.
 ### ARM macOS, M1 Pro
 - **Host**: MacBook Pro 14" (2021)
 - **CPU**: Apple M1 Pro (8-core)
- **GPU**: Apple M1 Pro (14-core) [Integrated]
+- **GPU**: Apple M1 Pro (14-core) integrated GPU
 - **Memory**: 16 GB
- **Model**: Gemma 4 E4B: efficient local reasoning; released Apr 2, 2026.
+- **Model**: Gemma 4 E4B
- **Inference**: llama.cpp with Metal (MPS) support
+- **Inference**: llama.cpp with Metal support
-### x86/64 Linux, NVIDIA RTX 2000
+### x86_64 Linux, NVIDIA RTX 2000
 - **Host**: ThinkPad P1 Gen 7 (Fedora 43)
 - **CPU**: Intel Core Ultra 7 155H
 - **GPU**: NVIDIA RTX 2000 Ada Generation
 - **Memory**: 32 GB
- **Model**: Gemma 4 E4B: efficient local reasoning; released Apr 2, 2026.
+- **Model**: Gemma 4 E4B
 - **Inference**: llama.cpp with CUDA 12.x support
 ## Pipeline
-| Stage    | What happens                                                            |
+| Stage    | Implementation                                                                                                 |
-| -------- | ----------------------------------------------------------------------- |
+| -------- | -------------------------------------------------------------------------------------------------------------- |
-| Load     | Reads `locations.json` and picks up to four city/country pairs.         |
+| Load     | `JsonLoader::LoadLocations()` reads `locations.json` into typed `Location` records.                            |
-| Enrich   | Calls the injected enrichment service for each sampled city.            |
+| Sample   | `BiergartenDataGenerator::QueryCitiesWithCountries()` samples up to 50 locations per run.                      |
-| Generate | Passes the city, country, and gathered context to the active generator. |
+| Enrich   | `WikipediaService` looks up city and beer context and keeps going when a lookup fails.                         |
-| Log      | Writes the generated breweries and any warnings through `spdlog`.       |
+| Generate | `MockGenerator` or `LlamaGenerator` produces brewery names and descriptions in English and the local language. |
 | Log      | `spdlog` writes results and warnings to the console.                                                           |
-If an enrichment lookup throws, the pipeline skips that city and keeps going. If the lookup returns an empty string, the city stays in the pipeline and is still passed to the generator.
+If enrichment or generation fails for a city, the pipeline skips that city and continues with the rest.
-## Core Components
+## Runtime Behaviour
-| Component               | Role                                                                   |
+- `WikipediaService` queries city, country, and beer-related Wikipedia extracts, then caches the first successful response for each query string.
-| ----------------------- | ---------------------------------------------------------------------- |
+- `GetLocationContext()` returns an empty string when the web client is unavailable or when lookup/parsing fails, so the pipeline can continue.
-| BiergartenDataGenerator | Orchestrates loading, enrichment lookup, generation, and logging.      |
+- `LlamaGenerator` validates the model output as structured JSON and retries malformed responses up to three times.
-| IEnrichmentService      | Abstraction for location-context providers.                            |
+- If the model output looks truncated, the retry path raises the token budget before trying again.
-| WikipediaService        | Default enrichment provider backed by Wikipedia and in-memory caching. |
+- `MockGenerator` uses stable hashes so the same city input produces the same brewery result every time.
-| LlamaGenerator          | Runs local GGUF inference and validates output.                        |
+
-| MockGenerator           | Produces deterministic fallback data without a model.                  |
+## Tech Stack
-| JsonLoader              | Parses the local `locations.json` file.                                |
+
-| CURLWebClient           | Handles HTTP requests to Wikipedia.                                    |
+- C++20
 - CMake 3.24+
 - Boost.JSON
 - Boost.ProgramOptions
 - Boost.DI
 - spdlog
 - libcurl
 - llama.cpp
 The build fetches Boost.DI, spdlog, and llama.cpp through CMake. The current configuration targets macOS and Linux; Metal is enabled on Apple Silicon, and CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
 ## Code Style
 The codebase uses modern C++20, RAII for ownership, `std::unique_ptr` for injected dependencies, `std::optional` for parse outcomes, `std::span` for read-only views over generated city data, and structured bindings in the pipeline loops.
 Formatting follows the Google C++ Style Guide via `.clang-format`, with a narrow column limit and two-space indentation.
 ## Build
-| Requirement          | Notes                                                                      |
+Requirements:
 | -------------------- | -------------------------------------------------------------------------- |
 | C++23 compiler       | GCC 13+ or Clang 16+ are good starting points.                             |
 | CMake                | Version 3.24 or newer.                                                     |
 | libcurl              | Required for Wikipedia requests.                                           |
 | Optional GPU tooling | CUDA on NVIDIA, HIP/ROCm on supported AMD systems, Metal on Apple Silicon. |
-Boost, Boost.DI, spdlog, and llama.cpp are fetched by CMake. On Apple Silicon, Metal is enabled automatically. On Linux, the build looks for CUDA or HIP/ROCm when the matching toolkit is present. There are no plans to support Windows.
+- C++20 compiler
 - CMake 3.24 or newer
 - libcurl
 - Boost with JSON and Program Options components installed
 ```bash
 cmake -S . -B build
 cmake --build build
 ```
 If the dependency build fails on macOS, check the repo build notes.
 ## Model
-Create a `models/` directory and download the GGUF file there before running the app.
+If you plan to run the model-backed path, create a `models/` directory and download the GGUF file there. Skip this step if you only want `--mocked`.
 ```bash
 mkdir -p models
@@ -76,31 +117,110 @@ curl -L \
 ## Run
-Run the executable from the build directory so the copied `locations.json` and `prompts/` directory are available.
+Run the executable from the `build/` directory so the copied `locations.json` and `prompts/` directory are available.
 ```bash
 ./biergarten-pipeline --mocked
 ./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
 ```
 ## CLI Flags
 | Flag            | Purpose                                                                                                         |
-| --------------- | ---------------------------------------------------------------------------- |
+| --------------- | --------------------------------------------------------------------------------------------------------------- |
-| `--mocked`      | Uses the mock generator instead of a model.                                  |
+| `--mocked`      | Uses the deterministic mock generator instead of loading a model.                                               |
-| `--model, -m`   | Path to a GGUF model file, such as `models/google_gemma-4-E4B-it-Q6_K.gguf`. |
+| `--model, -m`   | Path to a GGUF model file, such as `models/google_gemma-4-E4B-it-Q6_K.gguf`. Required unless `--mocked` is set. |
-| `--temperature` | Sampling temperature. Default: `1.0`.                                        |
+| `--temperature` | Sampling temperature. Default: `1.0`. Ignored when `--mocked` is set.                                           |
-| `--top-p`       | Nucleus sampling parameter. Default: `0.95`.                                 |
+| `--top-p`       | Nucleus sampling parameter. Default: `0.95`. Ignored when `--mocked` is set.                                    |
-| `--top-k`       | Top-k sampling parameter. Default: `64`.                                     |
+| `--top-k`       | Top-k sampling parameter. Default: `64`. Ignored when `--mocked` is set.                                        |
-| `--n-ctx`       | Context window size. Default: `8192`.                                        |
+| `--n-ctx`       | Context window size. Default: `8192`. Ignored when `--mocked` is set.                                           |
-| `--seed`        | Random seed. Default: `-1`.                                                  |
+| `--seed`        | Random seed. Default: `-1`, which selects a random seed at runtime. Ignored when `--mocked` is set.             |
-| `--help, -h`    | Prints usage.                                                                |
+| `--help, -h`    | Prints the usage text and exits.                                                                                |
-`--mocked` and `--model` are mutually exclusive. If neither is set, the program exits with an error. The sampling flags only matter when a model is loaded. The enrichment step is sequential now, and empty context is allowed.
+`--mocked` and `--model` are mutually exclusive. If neither is provided, the program exits with an error before the pipeline starts.
-## Layout
+The post-build step copies `prompts/` into `build/prompts/`, so rebuild after changing [prompts/system.md](prompts/system.md).
-| Path             | Use                                         |
+## Generated Output
-| ---------------- | ------------------------------------------- |
+
-| `includes/`      | Public headers.                             |
+Each successful run stores a `GeneratedBrewery` pair with the source location and a `BreweryResult` payload.
 | Field               | Meaning                                    |
 | ------------------- | ------------------------------------------ |
 | `name_en`           | Brewery name in English.                   |
 | `description_en`    | Brewery description in English.            |
 | `name_local`        | Brewery name in the local language.        |
 | `description_local` | Brewery description in the local language. |
 The final log dump also includes the city, country, state or province, ISO subdivision code, latitude, and longitude for each generated entry.
 ## Consumer Data Shape
 | Field                               | Why it matters to the app                         |
 | ----------------------------------- | ------------------------------------------------- |
 | `city`, `state_province`, `country` | Human-readable location labels and page headings  |
 | `iso3166_1`, `iso3166_2`            | Filtering, regional grouping, and locale matching |
 | `latitude`, `longitude`             | Map pins and nearby brewery views                 |
 | `local_languages`                   | Locale-aware copy selection                       |
 | `name_en`, `description_en`         | Default English display content                   |
 | `name_local`, `description_local`   | Local-language display content                    |
 | `region_context`                    | Richer copy for cards and detail screens          |
 ## Fixture Strategy
 - Use `--mocked` for stable fixtures and repeatable screenshots.
 - Use `--model` when you want geographically grounded content for demos.
 - Keep `locations.json` structured enough to support discovery and future filtering.
 - Rebuild after prompt changes so `build/prompts/system.md` stays aligned with the source prompt.
 - Treat the generated output as seed material for the app's brewery domain.
 ## Next Steps
 The current pipeline produces city-aware brewery records. The next pass should add user, beer, check-in, and rating fixtures so the app can exercise more of the brewery domain without live data.
 ### User Generation
 - Generate user profiles with stable names, bios, locale hints, and preference signals.
 - Keep the output deterministic for screenshots and Storybook runs, while still allowing larger randomized batches.
 - Include stable IDs so downstream fixtures can join on users.
 - Shape user data so it can support auth demos, follows, saved lists, and profile pages later.
 ### Beer Generation
 - Generate beer catalog entries with style, ABV, IBU, color, aroma notes, and food pairing hints.
 - Link beers back to breweries and cities so the app can build detail pages, related-content panels, and filtered views.
 - Keep brewery references and style metadata aligned with the rest of the fixture data.
 - Keep style coverage wide enough to exercise search, sort, and category filters.
 ### Check-In System
 - Produce check-in events as timestamped interactions between users and breweries.
 - Use a J-curve-like activity profile: a small set of users should account for most check-ins, while the rest appear only occasionally.
 - Add bursty behavior around weekends, travel, and brewery-heavy periods so the event stream resembles real usage.
 - Keep repeated visits to a smaller subset of breweries so popularity and recency views have something meaningful to rank.
 ### Beer Ratings
 - Generate rating events with a strong positive skew and a long tail of lower scores.
 - Avoid uniform distributions; the data should cluster around common values while still leaving room for outliers.
 - Attach timestamps and user IDs so the app can compute aggregates, trends, and recent-activity views.
 - Keep enough history to support average ratings, rating counts, and per-style comparisons.
 ## Suggested Code Tour
 - `src/main.cc` handles argument parsing and the dependency-injection composition root.
 - `src/biergarten_data_generator/` contains the orchestration, sampling, and logging flow.
 - `src/services/wikipedia/` contains the enrichment service and its cache.
 - `src/data_generation/llama/` contains local inference, prompt loading, and output validation.
 - `src/data_generation/mock/` contains the deterministic fallback path.
 - `includes/` holds the public interfaces and data models.
 ## Repo Layout
 | Path                       | Purpose                                           |
 | -------------------------- | ------------------------------------------------- |
 | `includes/`                | Public headers and shared models.                 |
 | `src/`                     | Implementation files.                             |
-| `locations.json` | Input city list copied into the build tree. |
+| `locations.json`           | Curated city input copied into the build tree.    |
-| `prompts/`       | Prompt text used by the model path.         |
+| `prompts/`                 | System prompt text used by the model-backed path. |
 | `biergarten_pipeline.puml` | Class and composition diagram.                    |