mirror of
https://github.com/aaronpo97/the-biergarten-app.git
synced 2026-06-01 01:54:00 +00:00
documentation updates
This commit is contained in:
@@ -2,6 +2,30 @@
|
|||||||
|
|
||||||
A C++20 command-line pipeline that samples city records from local JSON, enriches each with Wikipedia context, and generates bilingual brewery names and descriptions via a local GGUF model or a deterministic mock.
|
A C++20 command-line pipeline that samples city records from local JSON, enriches each with Wikipedia context, and generates bilingual brewery names and descriptions via a local GGUF model or a deterministic mock.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [How It Fits the Main App](#how-it-fits-the-main-app)
|
||||||
|
- [Tech Stack](#tech-stack)
|
||||||
|
- [Build](#build)
|
||||||
|
- [Model](#model)
|
||||||
|
- [Run](#run)
|
||||||
|
- [Architecture](#architecture)
|
||||||
|
- [Pipeline Stages](#pipeline-stages)
|
||||||
|
- [Key Components](#key-components)
|
||||||
|
- [Runtime Behaviour](#runtime-behaviour)
|
||||||
|
- [Generated Output](#generated-output)
|
||||||
|
- [Language Generation Quality](#language-generation-quality)
|
||||||
|
- [Known Issues](#known-issues)
|
||||||
|
- [Tested Hardware](#tested-hardware)
|
||||||
|
- [Repo Layout](#repo-layout)
|
||||||
|
- [Code Tour](#code-tour)
|
||||||
|
- [Fixture Strategy](#fixture-strategy)
|
||||||
|
- [Next Steps](#next-steps)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## How It Fits the Main App
|
## How It Fits the Main App
|
||||||
|
|
||||||
The pipeline is a data ingestion layer. It sits outside the web app runtime and produces seed records the app imports at startup or during a dedicated seed step.
|
The pipeline is a data ingestion layer. It sits outside the web app runtime and produces seed records the app imports at startup or during a dedicated seed step.
|
||||||
@@ -12,69 +36,8 @@ The pipeline is a data ingestion layer. It sits outside the web app runtime and
|
|||||||
| Beer reviews and ratings | Stable brewery fixtures with enough context to anchor review pages |
|
| Beer reviews and ratings | Stable brewery fixtures with enough context to anchor review pages |
|
||||||
| Social follow relationships | Repeatable brewery entities for feeds, follows, and saved lists |
|
| Social follow relationships | Repeatable brewery entities for feeds, follows, and saved lists |
|
||||||
| Geospatial brewery experiences | Latitude, longitude, and country-level metadata |
|
| Geospatial brewery experiences | Latitude, longitude, and country-level metadata |
|
||||||
| Additional frontend routes | Deterministic fixture data for Storybook, demos, and browser tests |
|
|
||||||
|
|
||||||
## Pipeline Stages
|
---
|
||||||
|
|
||||||
```
|
|
||||||
locations.json → Enrich (Wikipedia) → Generate (LLM / Mock) → log output
|
|
||||||
```
|
|
||||||
|
|
||||||
| Stage | Implementation |
|
|
||||||
| -------- | -------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| Load | `JsonLoader::LoadLocations()` reads `locations.json` into typed `Location` records. |
|
|
||||||
| Sample | `BiergartenDataGenerator::QueryCitiesWithCountries()` samples up to 50 locations per run. |
|
|
||||||
| Enrich | `WikipediaService` fetches city and beer context. Keeps going when a lookup fails. |
|
|
||||||
| Generate | `MockGenerator` or `LlamaGenerator` produces brewery names and descriptions in English and the local language. |
|
|
||||||
| Log | `spdlog` writes results and warnings to the console. |
|
|
||||||
|
|
||||||
If enrichment or generation fails for a city, that city is skipped and the pipeline continues.
|
|
||||||
|
|
||||||
## Key Components
|
|
||||||
|
|
||||||
- `src/main.cc` — argument parsing and Boost.DI composition root.
|
|
||||||
- `JsonLoader` — validates curated location input.
|
|
||||||
- `WikipediaService` — queries English and local-language extracts, caches results, returns empty context on failure.
|
|
||||||
- `LlamaGenerator` — formats prompts for Gemma 4, validates JSON output, retries malformed responses up to three times. If output looks truncated, the retry raises the token budget before trying again.
|
|
||||||
- `MockGenerator` — stable hash-based output so the same city input always produces the same brewery.
|
|
||||||
- Brewery payloads include English and local-language name and description fields.
|
|
||||||
|
|
||||||
A structural overview is in [biergarten_pipeline.puml](biergarten_pipeline.puml).
|
|
||||||
|
|
||||||
## Runtime Behaviour
|
|
||||||
|
|
||||||
`WikipediaService` queries city, country, and beer-related Wikipedia extracts in both English and the local language, then caches the first successful response per query string. Both extracts are passed into the prompt so the model can draw on local-language sources without a separate translation step.
|
|
||||||
|
|
||||||
`GetLocationContext()` returns an empty string when the web client is unavailable or when lookup/parsing fails.
|
|
||||||
|
|
||||||
`LlamaGenerator` validates model output as structured JSON. The retry path exists as a safety hatch for cases where the reasoning block consumes available token budget and compresses the JSON output space. All runs to date have produced valid output on the first pass; the path is kept for resilience.
|
|
||||||
|
|
||||||
`MockGenerator` uses stable hashes for repeatable output in demos and Storybook runs.
|
|
||||||
|
|
||||||
## Generated Output
|
|
||||||
|
|
||||||
Each successful run stores a `GeneratedBrewery` pair with the source location and a `BreweryResult` payload.
|
|
||||||
|
|
||||||
| Field | Meaning |
|
|
||||||
| ------------------- | ------------------------------------------ |
|
|
||||||
| `name_en` | Brewery name in English. |
|
|
||||||
| `description_en` | Brewery description in English. |
|
|
||||||
| `name_local` | Brewery name in the local language. |
|
|
||||||
| `description_local` | Brewery description in the local language. |
|
|
||||||
|
|
||||||
The log dump also includes city, country, state or province, ISO subdivision code, latitude, and longitude for each entry.
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
| Field | Why it matters |
|
|
||||||
| ----------------------------------- | ------------------------------------------------ |
|
|
||||||
| `city`, `state_province`, `country` | Human-readable location labels and page headings |
|
|
||||||
| `iso3166_1`, `iso3166_2` | Filtering, regional grouping, locale matching |
|
|
||||||
| `latitude`, `longitude` | Map pins and nearby brewery views |
|
|
||||||
| `local_languages` | Locale-aware copy selection |
|
|
||||||
| `name_en`, `description_en` | Default English display content |
|
|
||||||
| `name_local`, `description_local` | Local-language display content |
|
|
||||||
| `region_context` | Richer copy for cards and detail pages |
|
|
||||||
|
|
||||||
## Tech Stack
|
## Tech Stack
|
||||||
|
|
||||||
@@ -87,33 +50,9 @@ The log dump also includes city, country, state or province, ISO subdivision cod
|
|||||||
|
|
||||||
The build fetches Boost.DI, spdlog, and llama.cpp via CMake. Metal is enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
|
The build fetches Boost.DI, spdlog, and llama.cpp via CMake. Metal is enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present.
|
||||||
|
|
||||||
## Code Style
|
> **Code Style:** Modern C++20 throughout — RAII for ownership, `std::unique_ptr` for injected dependencies, `std::optional` for parse outcomes, `std::span` for read-only views over generated city data, structured bindings in pipeline loops. Formatting follows the Google C++ Style Guide via `.clang-format` with a narrow column limit and two-space indentation.
|
||||||
|
|
||||||
Modern C++20 throughout: RAII for ownership, `std::unique_ptr` for injected dependencies, `std::optional` for parse outcomes, `std::span` for read-only views over generated city data, structured bindings in pipeline loops. Formatting follows the Google C++ Style Guide via `.clang-format` with a narrow column limit and two-space indentation.
|
---
|
||||||
|
|
||||||
## Tested Hardware
|
|
||||||
|
|
||||||
### ARM macOS — M1 Pro
|
|
||||||
|
|
||||||
| | |
|
|
||||||
| --------- | --------------------------------- |
|
|
||||||
| Host | MacBook Pro 14" (2021) |
|
|
||||||
| CPU | Apple M1 Pro (8-core) |
|
|
||||||
| GPU | Apple M1 Pro (14-core integrated) |
|
|
||||||
| Memory | 16 GB |
|
|
||||||
| Model | Gemma 4 E4B |
|
|
||||||
| Inference | llama.cpp with Metal |
|
|
||||||
|
|
||||||
### x86_64 Linux — NVIDIA RTX 2000
|
|
||||||
|
|
||||||
| | |
|
|
||||||
| --------- | ------------------------------ |
|
|
||||||
| Host | ThinkPad P1 Gen 7 (Fedora 43) |
|
|
||||||
| CPU | Intel Core Ultra 7 155H |
|
|
||||||
| GPU | NVIDIA RTX 2000 Ada Generation |
|
|
||||||
| Memory | 32 GB |
|
|
||||||
| Model | Gemma 4 E4B |
|
|
||||||
| Inference | llama.cpp with CUDA 12.x |
|
|
||||||
|
|
||||||
## Build
|
## Build
|
||||||
|
|
||||||
@@ -124,9 +63,11 @@ cmake -S . -B build
|
|||||||
cmake --build build
|
cmake --build build
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Model
|
## Model
|
||||||
|
|
||||||
Skip this step if you only need `--mocked`.
|
> Skip this step if you only need `--mocked`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
mkdir -p models
|
mkdir -p models
|
||||||
@@ -135,6 +76,8 @@ curl -L \
|
|||||||
https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true
|
https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Run
|
## Run
|
||||||
|
|
||||||
Run from `build/` so the copied `locations.json` and `prompts/` are available.
|
Run from `build/` so the copied `locations.json` and `prompts/` are available.
|
||||||
@@ -144,7 +87,7 @@ Run from `build/` so the copied `locations.json` and `prompts/` are available.
|
|||||||
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
|
./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1
|
||||||
```
|
```
|
||||||
|
|
||||||
## CLI Flags
|
### CLI Flags
|
||||||
|
|
||||||
| Flag | Purpose |
|
| Flag | Purpose |
|
||||||
| --------------- | ------------------------------------------------------- |
|
| --------------- | ------------------------------------------------------- |
|
||||||
@@ -161,14 +104,65 @@ Run from `build/` so the copied `locations.json` and `prompts/` are available.
|
|||||||
|
|
||||||
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after editing [prompts/system.md](prompts/system.md).
|
The post-build step copies `prompts/` into `build/prompts/`. Rebuild after editing [prompts/system.md](prompts/system.md).
|
||||||
|
|
||||||
## Fixture Strategy
|
---
|
||||||
|
|
||||||
- `--mocked` for stable fixtures, repeatable screenshots, and Storybook runs.
|
## Architecture
|
||||||
- `--model` when geographically grounded content matters for demos.
|
|
||||||
- Keep `locations.json` structured enough to support discovery and future filtering.
|
|
||||||
- Treat SQLite output as seed material for the app's brewery domain, not production data.
|
|
||||||
|
|
||||||
## Consumer Data Shape
|
### Pipeline Stages
|
||||||
|
|
||||||
|
| Stage | Implementation |
|
||||||
|
| -------- | -------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| Load | `JsonLoader::LoadLocations()` reads `locations.json` into typed `Location` records. |
|
||||||
|
| Sample | `BiergartenDataGenerator::QueryCitiesWithCountries()` samples up to 50 locations per run. |
|
||||||
|
| Enrich | `WikipediaService` fetches city and beer context. Keeps going when a lookup fails. |
|
||||||
|
| Generate | `MockGenerator` or `LlamaGenerator` produces brewery names and descriptions in English and the local language. |
|
||||||
|
| Log | `spdlog` writes results and warnings to the console. |
|
||||||
|
|
||||||
|
If enrichment or generation fails for a city, that city is skipped and the pipeline continues.
|
||||||
|
|
||||||
|
### Key Components
|
||||||
|
|
||||||
|
- `src/main.cc` — argument parsing and Boost.DI composition root.
|
||||||
|
- `JsonLoader` — validates curated location input.
|
||||||
|
- `WikipediaService` — queries English and local-language extracts, caches results, returns empty context on failure.
|
||||||
|
- `LlamaGenerator` — formats prompts for Gemma 4, validates JSON output, retries malformed responses up to three times. If output looks truncated, the retry raises the token budget before trying again.
|
||||||
|
- `MockGenerator` — stable hash-based output so the same city input always produces the same brewery.
|
||||||
|
- Brewery payloads include English and local-language name and description fields.
|
||||||
|
|
||||||
|
### Runtime Behaviour
|
||||||
|
|
||||||
|
`WikipediaService` queries city, country, and beer-related Wikipedia extracts in both English and the local language, then caches the first successful response per query string. Both extracts are passed into the prompt so the model can draw on local-language sources without a separate translation step.
|
||||||
|
|
||||||
|
`GetLocationContext()` returns an empty string when the web client is unavailable or when lookup/parsing fails.
|
||||||
|
|
||||||
|
`LlamaGenerator` validates model output as structured JSON. The retry path exists as a safety hatch for cases where the reasoning block consumes available token budget and compresses the JSON output space. All runs to date have produced valid output on the first pass; the path is kept for resilience.
|
||||||
|
|
||||||
|
`MockGenerator` uses stable hashes for repeatable output in demos and Storybook runs.
|
||||||
|
|
||||||
|
### Process Flow — Activity Diagram
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Architectural Overview — Class Diagram
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Generated Output
|
||||||
|
|
||||||
|
Each successful run stores a `GeneratedBrewery` pair with the source location and a `BreweryResult` payload.
|
||||||
|
|
||||||
|
| Field | Meaning |
|
||||||
|
| ------------------- | ------------------------------------------ |
|
||||||
|
| `name_en` | Brewery name in English. |
|
||||||
|
| `description_en` | Brewery description in English. |
|
||||||
|
| `name_local` | Brewery name in the local language. |
|
||||||
|
| `description_local` | Brewery description in the local language. |
|
||||||
|
|
||||||
|
The log dump also includes city, country, state or province, ISO subdivision code, latitude, and longitude for each entry.
|
||||||
|
|
||||||
|
### Consumer Data Shape
|
||||||
|
|
||||||
| Field | Why it matters |
|
| Field | Why it matters |
|
||||||
| ----------------------------------- | ------------------------------------------------ |
|
| ----------------------------------- | ------------------------------------------------ |
|
||||||
@@ -180,64 +174,11 @@ The post-build step copies `prompts/` into `build/prompts/`. Rebuild after editi
|
|||||||
| `name_local`, `description_local` | Local-language display content |
|
| `name_local`, `description_local` | Local-language display content |
|
||||||
| `region_context` | Richer copy for cards and detail pages |
|
| `region_context` | Richer copy for cards and detail pages |
|
||||||
|
|
||||||
## Process Flow
|
---
|
||||||
|
|
||||||

|
## Language Generation Quality
|
||||||
|
|
||||||
## Next Steps
|
The generation pipeline passes local language codes to the model to retrieve a translated `description_local`.
|
||||||
|
|
||||||
The pipeline currently produces city-aware brewery records. The next passes add SQLite output and additional fixture types so the app can exercise the full brewery domain without live data.
|
|
||||||
|
|
||||||
### Testing [Very High Importance]
|
|
||||||
|
|
||||||
- Unit test JSON validation and retry logic against malformed, truncated, and empty model outputs.
|
|
||||||
- Integration test the enrichment pipeline with missing context, short context, and fake context inputs.
|
|
||||||
- Adversarial context tests: feed plausible but geographically incorrect Wikipedia extracts and verify the model does not silently blend them with training data.
|
|
||||||
- Verify bilingual enrichment behaviour when only an English extract is available versus when both extracts are present.
|
|
||||||
- Confirm the retry path is reachable when the reasoning block consumes available token budget.
|
|
||||||
|
|
||||||
### SQLite Output [Highest Importance]
|
|
||||||
|
|
||||||
Write generated records to a SQLite database for downstream OLTP seeding. Normalized schema with foreign keys between locations and breweries. Output replaces the current log-only result so the pipeline functions as a proper ingestion layer.
|
|
||||||
|
|
||||||
### Beer Generation
|
|
||||||
|
|
||||||
Generate catalog entries with style, ABV, IBU, color, aroma notes, and food pairing hints. Link beers back to breweries and cities. Keep style coverage wide enough to exercise search, sort, and category filters.
|
|
||||||
|
|
||||||
### User Generation
|
|
||||||
|
|
||||||
Generate user profiles with stable names, bios, locale hints, and preference signals. Include stable IDs for downstream fixture joins. Keep output deterministic for screenshots while allowing larger randomized batches.
|
|
||||||
|
|
||||||
### Check-In System
|
|
||||||
|
|
||||||
Produce timestamped check-in events between users and breweries. Use a J-curve activity profile — a small set of users accounts for most check-ins, the rest appear occasionally. Add bursty behaviour around weekends and travel periods.
|
|
||||||
|
|
||||||
### Beer Ratings
|
|
||||||
|
|
||||||
Generate rating events with a strong positive skew and a long tail of lower scores. Avoid uniform distributions. Attach timestamps and user IDs so the app can compute averages, trends, and per-style comparisons.
|
|
||||||
|
|
||||||
## Code Tour
|
|
||||||
|
|
||||||
- `src/main.cc` — argument parsing and DI composition root.
|
|
||||||
- `src/biergarten_data_generator/` — orchestration, sampling, logging.
|
|
||||||
- `src/services/wikipedia/` — enrichment service and cache.
|
|
||||||
- `src/data_generation/llama/` — local inference, prompt loading, output validation.
|
|
||||||
- `src/data_generation/mock/` — deterministic fallback.
|
|
||||||
- `includes/` — public interfaces and data models.
|
|
||||||
|
|
||||||
## Repo Layout
|
|
||||||
|
|
||||||
| Path | Purpose |
|
|
||||||
| ---------------- | ---------------------------------------------- |
|
|
||||||
| `includes/` | Public headers and shared models. |
|
|
||||||
| `src/` | Implementation files. |
|
|
||||||
| `locations.json` | Curated city input copied into the build tree. |
|
|
||||||
| `prompts/` | System prompt used by the model-backed path. |
|
|
||||||
| `diagrams/` | Architecture and pipeline diagrams. |
|
|
||||||
|
|
||||||
### Language Generation Quality
|
|
||||||
|
|
||||||
The generation pipeline passes local language codes to the model to retrieve a translated description_local.
|
|
||||||
|
|
||||||
Output quality is reliable for high-resource languages such as French, though it may struggle with regional variants and idiomatic phrasing. This can be seen with these data points:
|
Output quality is reliable for high-resource languages such as French, though it may struggle with regional variants and idiomatic phrasing. This can be seen with these data points:
|
||||||
|
|
||||||
@@ -296,9 +237,7 @@ Output quality is reliable for high-resource languages such as French, though it
|
|||||||
]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Output:
|
Output sample: [./out-sample/french-cities.log.example](out-sample/french-cities.log.example)
|
||||||
|
|
||||||
seen in [./out-sample/french-cities.log.example](out-sample/french-cities.log.example)
|
|
||||||
|
|
||||||
### Known Issues
|
### Known Issues
|
||||||
|
|
||||||
@@ -306,8 +245,99 @@ seen in [./out-sample/french-cities.log.example](out-sample/french-cities.log.ex
|
|||||||
|
|
||||||
For languages such as Welsh (Wales), Maori (Aotearoa/New Zealand), or Sicilian (Sicily, Italy), the model can generate text that looks syntactically plausible but is semantically incoherent. This comes from limited training-data coverage rather than prompt engineering.
|
For languages such as Welsh (Wales), Maori (Aotearoa/New Zealand), or Sicilian (Sicily, Italy), the model can generate text that looks syntactically plausible but is semantically incoherent. This comes from limited training-data coverage rather than prompt engineering.
|
||||||
|
|
||||||
##### Proposed Mitigations
|
#### Proposed Mitigations
|
||||||
|
|
||||||
- Prevention via allowlist: introduce a high-resource language allowlist. If a location's code is unlisted, skip description_local generation and fall back to English.
|
- **Prevention via allowlist:** introduce a high-resource language allowlist. If a location's code is unlisted, skip `description_local` generation and fall back to English.
|
||||||
- Upstream sanitization: strip known low-resource language codes from the locations.json payload before generation.
|
- **Upstream sanitization:** strip known low-resource language codes from the `locations.json` payload before generation.
|
||||||
- Downstream flagging: add a description_local_confidence column to the SQLite schema so downstream applications can filter or flag potentially hallucinated text by language tier.
|
- **Downstream flagging:** add a `description_local_confidence` column to the SQLite schema so downstream applications can filter or flag potentially hallucinated text by language tier.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tested Hardware
|
||||||
|
|
||||||
|
### ARM macOS — M1 Pro
|
||||||
|
|
||||||
|
| | |
|
||||||
|
| --------- | --------------------------------- |
|
||||||
|
| Host | MacBook Pro 14" (2021) |
|
||||||
|
| CPU | Apple M1 Pro (8-core) |
|
||||||
|
| GPU | Apple M1 Pro (14-core integrated) |
|
||||||
|
| Memory | 16 GB |
|
||||||
|
| Model | Gemma 4 E4B |
|
||||||
|
| Inference | llama.cpp with Metal |
|
||||||
|
|
||||||
|
### x86_64 Linux — NVIDIA RTX 2000
|
||||||
|
|
||||||
|
| | |
|
||||||
|
| --------- | ------------------------------ |
|
||||||
|
| Host | ThinkPad P1 Gen 7 (Fedora 43) |
|
||||||
|
| CPU | Intel Core Ultra 7 155H |
|
||||||
|
| GPU | NVIDIA RTX 2000 Ada Generation |
|
||||||
|
| Memory | 32 GB |
|
||||||
|
| Model | Gemma 4 E4B |
|
||||||
|
| Inference | llama.cpp with CUDA 12.x |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Repo Layout
|
||||||
|
|
||||||
|
| Path | Purpose |
|
||||||
|
| ---------------- | ---------------------------------------------- |
|
||||||
|
| `includes/` | Public headers and shared models. |
|
||||||
|
| `src/` | Implementation files. |
|
||||||
|
| `locations.json` | Curated city input copied into the build tree. |
|
||||||
|
| `prompts/` | System prompt used by the model-backed path. |
|
||||||
|
| `diagrams/` | Architecture and pipeline diagrams. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Code Tour
|
||||||
|
|
||||||
|
- `src/main.cc` — argument parsing and DI composition root.
|
||||||
|
- `src/biergarten_data_generator/` — orchestration, sampling, logging.
|
||||||
|
- `src/services/wikipedia/` — enrichment service and cache.
|
||||||
|
- `src/data_generation/llama/` — local inference, prompt loading, output validation.
|
||||||
|
- `src/data_generation/mock/` — deterministic fallback.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fixture Strategy
|
||||||
|
|
||||||
|
- `--mocked` for stable fixtures, repeatable screenshots, and Storybook runs.
|
||||||
|
- `--model` when geographically grounded content matters for demos.
|
||||||
|
- Keep `locations.json` structured enough to support discovery and future filtering.
|
||||||
|
- Treat SQLite output as seed material for the app's brewery domain, not production data.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
The pipeline currently produces city-aware brewery records. The next passes add SQLite output and additional fixture types so the app can exercise the full brewery domain without live data.
|
||||||
|
|
||||||
|
### SQLite Output _(Highest Importance)_
|
||||||
|
|
||||||
|
Write generated records to a SQLite database for downstream OLTP seeding. Normalized schema with foreign keys between locations and breweries. Output replaces the current log-only result so the pipeline functions as a proper ingestion layer.
|
||||||
|
|
||||||
|
### Testing _(Very High Importance)_
|
||||||
|
|
||||||
|
- Unit test JSON validation and retry logic against malformed, truncated, and empty model outputs.
|
||||||
|
- Integration test the enrichment pipeline with missing context, short context, and fake context inputs.
|
||||||
|
- Adversarial context tests: feed plausible but geographically incorrect Wikipedia extracts and verify the model does not silently blend them with training data.
|
||||||
|
- Verify bilingual enrichment behaviour when only an English extract is available versus when both extracts are present.
|
||||||
|
- Confirm the retry path is reachable when the reasoning block consumes available token budget.
|
||||||
|
|
||||||
|
### Beer Generation
|
||||||
|
|
||||||
|
Generate catalog entries with style, ABV, IBU, color, aroma notes, and food pairing hints. Link beers back to breweries and cities. Keep style coverage wide enough to exercise search, sort, and category filters.
|
||||||
|
|
||||||
|
### User Generation
|
||||||
|
|
||||||
|
Generate user profiles with stable names, bios, locale hints, and preference signals. Include stable IDs for downstream fixture joins. Keep output deterministic for screenshots while allowing larger randomized batches.
|
||||||
|
|
||||||
|
### Check-In System
|
||||||
|
|
||||||
|
Produce timestamped check-in events between users and breweries. Use a J-curve activity profile — a small set of users accounts for most check-ins, the rest appear occasionally. Add bursty behaviour around weekends and travel periods.
|
||||||
|
|
||||||
|
### Beer Ratings
|
||||||
|
|
||||||
|
Generate rating events with a strong positive skew and a long tail of lower scores. Avoid uniform distributions. Attach timestamps and user IDs so the app can compute averages, trends, and per-style comparisons.
|
||||||
|
|||||||
@@ -1,89 +1,128 @@
|
|||||||
@startuml
|
@startuml
|
||||||
skinparam style strictuml
|
skinparam style strictuml
|
||||||
skinparam ActivityBackgroundColor #FEFECE
|
skinparam defaultFontName "DM Sans"
|
||||||
skinparam ActivityBorderColor #A80036
|
skinparam defaultFontSize 14
|
||||||
|
skinparam titleFontName "Volkhov"
|
||||||
|
skinparam titleFontSize 20
|
||||||
|
skinparam backgroundColor #FAFCF9
|
||||||
|
skinparam defaultFontColor #28342A
|
||||||
|
skinparam titleFontColor #28342A
|
||||||
|
skinparam ArrowColor #628A5B
|
||||||
|
skinparam NoteBackgroundColor #EAF0E8
|
||||||
|
skinparam NoteBorderColor #547461
|
||||||
|
skinparam ActivityBackgroundColor #FAFCF9
|
||||||
|
skinparam ActivityBorderColor #547461
|
||||||
|
skinparam ActivityDiamondBackgroundColor #FAFCF9
|
||||||
|
skinparam ActivityDiamondBorderColor #628A5B
|
||||||
|
skinparam ActivityBarColor #628A5B
|
||||||
|
skinparam SwimlaneBorderColor transparent
|
||||||
|
skinparam SwimlaneBorderThickness 0
|
||||||
|
|
||||||
title Biergarten Pipeline - Activity Diagram (Swimlanes)
|
title The Biergarten Data Pipeline
|
||||||
|
|
||||||
|Orchestrator|
|
|#F2F6F0|main.cc|
|
||||||
start
|
start
|
||||||
:Parse Command-Line Arguments;
|
:ParseArguments(argc, argv);
|
||||||
note right
|
note right
|
||||||
Determines mode (mocked vs model)
|
Validates --mocked, --model,
|
||||||
and LLM sampling parameters.
|
--temperature, --top-p, etc.
|
||||||
end note
|
end note
|
||||||
|
|
||||||
if (Are arguments valid?) then (no)
|
if (Are arguments valid?) then (no)
|
||||||
:Log Error & Display Usage;
|
:spdlog::error usage info;
|
||||||
stop
|
stop
|
||||||
else (yes)
|
else (yes)
|
||||||
endif
|
endif
|
||||||
|
|
||||||
:Initialize Global States;
|
:Init CurlGlobalState & LlamaBackendState;
|
||||||
:Construct Dependency Injector (Boost.DI);
|
:di::make_injector(...);
|
||||||
:Instantiate BiergartenDataGenerator;
|
note right
|
||||||
|
Binds CURLWebClient, WikipediaService,
|
||||||
|
Gemma4JinjaPromptFormatter, and
|
||||||
|
either MockGenerator or LlamaGenerator
|
||||||
|
end note
|
||||||
|
:injector.create<BiergartenDataGenerator>();
|
||||||
|
:BiergartenDataGenerator::Run();
|
||||||
|
|
||||||
|DataLoader|
|
|#EAF0E8|BiergartenDataGenerator|
|
||||||
|
:QueryCitiesWithCountries();
|
||||||
|
|
||||||
|
|#E2EBDC|JsonLoader|
|
||||||
:JsonLoader::LoadLocations("locations.json");
|
:JsonLoader::LoadLocations("locations.json");
|
||||||
:Sample up to 50 Locations;
|
:std::ranges::sample(all_locations, 50);
|
||||||
note right: Randomly samples from loaded array
|
|
||||||
|
|
||||||
|Enrichment|
|
|#EAF0E8|BiergartenDataGenerator|
|
||||||
while (For each sampled Location?) is (Remaining locations)
|
while (For each sampled Location?) is (Remaining cities)
|
||||||
:GetLocationContext(Location);
|
|#DCE8D8|WikipediaService|
|
||||||
:Fetch extract for Region (City, Country);
|
:GetLocationContext(loc);
|
||||||
:Fetch extract for "beer in <Country>";
|
:FetchExtract("City, Country");
|
||||||
:Fetch extract for "beer in <City>";
|
:FetchExtract("beer in Country");
|
||||||
:Store EnrichedCity (Location + Context);
|
:FetchExtract("beer in City");
|
||||||
|
note right: Backed by CURLWebClient::Get
|
||||||
|
|
||||||
|
|#EAF0E8|BiergartenDataGenerator|
|
||||||
|
if (Lookup failed?) then (yes)
|
||||||
|
:spdlog::warn "context lookup failed";
|
||||||
|
else (no)
|
||||||
|
:Store EnrichedCity{Location, region_context};
|
||||||
|
endif
|
||||||
endwhile (Done)
|
endwhile (Done)
|
||||||
|
|
||||||
|Generator|
|
:GenerateBreweries(enriched_cities);
|
||||||
while (For each EnrichedCity?) is (Remaining enriched cities)
|
|
||||||
|
|#E5EDE1|DataGenerator|
|
||||||
|
while (For each EnrichedCity?) is (Remaining cities)
|
||||||
if (Generator Mode) then (MockGenerator)
|
if (Generator Mode) then (MockGenerator)
|
||||||
:Calculate Deterministic Hash;
|
:DeterministicHash(location);
|
||||||
:Select Adjective, Noun, and Description;
|
:Select from kBreweryAdjectives, kBreweryNouns,\nkBreweryDescriptions;
|
||||||
:Build BreweryResult;
|
:Format BreweryResult;
|
||||||
:Store GeneratedBrewery into results;
|
|
||||||
|
|
||||||
else (LlamaGenerator)
|
else (LlamaGenerator)
|
||||||
:Prepare System and User Prompts;
|
:PrepareRegionContext(region_context);
|
||||||
:Attempt Counter = 1;
|
:LoadBrewerySystemPrompt("prompts/system.md");
|
||||||
|
:Format user_prompt;
|
||||||
|
:Attempt = 0;
|
||||||
repeat
|
repeat
|
||||||
:Run Model Inference (llama.cpp);
|
:Infer(system_prompt, user_prompt, max_tokens, kBreweryJsonGrammar);
|
||||||
note right: Applies Gemma 4 Jinja formatting\nand GBNF JSON Grammar
|
note right
|
||||||
|
Uses Gemma4JinjaPromptFormatter,
|
||||||
:Validate JSON Output (ValidateBreweryJson);
|
llama_tokenize, and llama_sampler_sample
|
||||||
|
end note
|
||||||
|
:ValidateBreweryJson(raw, brewery);
|
||||||
|
|
||||||
if (Is JSON Valid?) then (yes)
|
if (Is JSON Valid?) then (yes)
|
||||||
:Parse into BreweryResult;
|
|
||||||
break
|
break
|
||||||
else (no)
|
else (no)
|
||||||
if (Error == "incomplete JSON") then (yes)
|
if (Error == "incomplete JSON") then (yes)
|
||||||
:Increase max_tokens threshold;
|
:max_tokens += 700;
|
||||||
|
endif
|
||||||
|
:Update user_prompt with validation error;
|
||||||
|
:Attempt++;
|
||||||
|
endif
|
||||||
|
|
||||||
|
repeat while (Attempt < 3?) is (yes)
|
||||||
|
|
||||||
|
if (Still Invalid?) then (yes)
|
||||||
|
:throw std::runtime_error;
|
||||||
else (no)
|
else (no)
|
||||||
|
:Return BreweryResult;
|
||||||
endif
|
endif
|
||||||
:Append Error details to Prompt for LLM correction;
|
|
||||||
:Increment Attempt Counter;
|
|
||||||
endif
|
endif
|
||||||
|
|
||||||
repeat while (Attempt <= 3?) is (yes)
|
|#EAF0E8|BiergartenDataGenerator|
|
||||||
|
if (Exception thrown?) then (yes)
|
||||||
if (Still Invalid after 3 attempts?) then (yes)
|
:spdlog::warn "brewery generation failed";
|
||||||
|Orchestrator|
|
|
||||||
:Log Warning;
|
|
||||||
|Generator|
|
|
||||||
:Skip City;
|
|
||||||
else (no)
|
else (no)
|
||||||
:Store GeneratedBrewery into results;
|
:Store GeneratedBrewery;
|
||||||
endif
|
endif
|
||||||
endif
|
|#E5EDE1|DataGenerator|
|
||||||
|
|
||||||
endwhile (Done)
|
endwhile (Done)
|
||||||
|
|
||||||
|Orchestrator|
|
|#EAF0E8|BiergartenDataGenerator|
|
||||||
:LogResults();
|
:LogResults();
|
||||||
note right: Dumps generated JSON fields to spdlog
|
note right: spdlog::info dump of generated JSON fields
|
||||||
:Exit Pipeline Successfully (0);
|
|
||||||
|
|#F2F6F0|main.cc|
|
||||||
|
:Return 0;
|
||||||
stop
|
stop
|
||||||
|
|
||||||
@enduml
|
@enduml
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
112
pipeline/diagrams/class-diagram.puml
Normal file
112
pipeline/diagrams/class-diagram.puml
Normal file
@@ -0,0 +1,112 @@
|
|||||||
|
@startuml
|
||||||
|
skinparam style strictuml
|
||||||
|
skinparam defaultFontName "DM Sans"
|
||||||
|
skinparam defaultFontSize 14
|
||||||
|
skinparam titleFontName "Volkhov"
|
||||||
|
skinparam titleFontSize 20
|
||||||
|
skinparam backgroundColor #FAFCF9
|
||||||
|
skinparam defaultFontColor #28342A
|
||||||
|
skinparam titleFontColor #28342A
|
||||||
|
skinparam ArrowColor #628A5B
|
||||||
|
|
||||||
|
skinparam class {
|
||||||
|
BackgroundColor #FAFCF9
|
||||||
|
HeaderBackgroundColor #EAF0E8
|
||||||
|
BorderColor #547461
|
||||||
|
ArrowColor #628A5B
|
||||||
|
FontColor #28342A
|
||||||
|
}
|
||||||
|
|
||||||
|
skinparam note {
|
||||||
|
BackgroundColor #EAF0E8
|
||||||
|
BorderColor #547461
|
||||||
|
FontColor #28342A
|
||||||
|
}
|
||||||
|
|
||||||
|
title The Biergarten Data Pipeline - Class Diagram
|
||||||
|
|
||||||
|
class BiergartenDataGenerator {
|
||||||
|
- context_service_ : std::unique_ptr<IEnrichmentService>
|
||||||
|
- generator_ : std::unique_ptr<DataGenerator>
|
||||||
|
- generated_breweries_ : std::vector<GeneratedBrewery>
|
||||||
|
+ Run() : bool
|
||||||
|
- QueryCitiesWithCountries() : std::vector<Location>
|
||||||
|
- GenerateBreweries(cities : std::span<const EnrichedCity>) : void
|
||||||
|
- LogResults() : void
|
||||||
|
}
|
||||||
|
|
||||||
|
interface IEnrichmentService <<interface>> {
|
||||||
|
+ GetLocationContext(loc : const Location&) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
class WikipediaService {
|
||||||
|
- client_ : std::unique_ptr<WebClient>
|
||||||
|
- extract_cache_ : std::unordered_map<std::string, std::string>
|
||||||
|
+ GetLocationContext(loc : const Location&) : std::string
|
||||||
|
- FetchExtract(query : std::string_view) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
interface WebClient <<interface>> {
|
||||||
|
+ Get(url : const std::string&) : std::string
|
||||||
|
+ UrlEncode(value : const std::string&) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
class CURLWebClient {
|
||||||
|
+ Get(url : const std::string&) : std::string
|
||||||
|
+ UrlEncode(value : const std::string&) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
interface DataGenerator <<interface>> {
|
||||||
|
+ GenerateBrewery(location : const Location&, region_context : const std::string&) : BreweryResult
|
||||||
|
+ GenerateUser(locale : const std::string&) : UserResult
|
||||||
|
}
|
||||||
|
|
||||||
|
class MockGenerator {
|
||||||
|
+ GenerateBrewery(...) : BreweryResult
|
||||||
|
+ GenerateUser(...) : UserResult
|
||||||
|
- DeterministicHash(location : const Location&) : size_t
|
||||||
|
}
|
||||||
|
|
||||||
|
class LlamaGenerator {
|
||||||
|
- model_ : ModelHandle
|
||||||
|
- context_ : ContextHandle
|
||||||
|
- prompt_formatter_ : std::unique_ptr<IPromptFormatter>
|
||||||
|
- rng_ : std::mt19937
|
||||||
|
+ GenerateBrewery(...) : BreweryResult
|
||||||
|
+ GenerateUser(...) : UserResult
|
||||||
|
- Load(model_path : const std::string&) : void
|
||||||
|
- Infer(...) : std::string
|
||||||
|
- InferFormatted(...) : std::string
|
||||||
|
- LoadBrewerySystemPrompt(...) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
interface IPromptFormatter <<interface>> {
|
||||||
|
+ Format(system_prompt : std::string_view, user_prompt : std::string_view) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
class Gemma4JinjaPromptFormatter {
|
||||||
|
+ Format(system_prompt : std::string_view, user_prompt : std::string_view) : std::string
|
||||||
|
}
|
||||||
|
|
||||||
|
class JsonLoader {
|
||||||
|
+ {static} LoadLocations(filepath : const std::filesystem::path&) : std::vector<Location>
|
||||||
|
}
|
||||||
|
|
||||||
|
' Structural Relationships / Dependency Injection
|
||||||
|
BiergartenDataGenerator *-- IEnrichmentService : owns
|
||||||
|
BiergartenDataGenerator *-- DataGenerator : owns
|
||||||
|
|
||||||
|
IEnrichmentService <|.. WikipediaService : implements
|
||||||
|
WikipediaService *-- WebClient : owns
|
||||||
|
|
||||||
|
WebClient <|.. CURLWebClient : implements
|
||||||
|
|
||||||
|
DataGenerator <|.. MockGenerator : implements
|
||||||
|
DataGenerator <|.. LlamaGenerator : implements
|
||||||
|
|
||||||
|
LlamaGenerator *-- IPromptFormatter : uses
|
||||||
|
|
||||||
|
IPromptFormatter <|.. Gemma4JinjaPromptFormatter : implements
|
||||||
|
|
||||||
|
BiergartenDataGenerator ..> JsonLoader : uses
|
||||||
|
@enduml
|
||||||
1
pipeline/diagrams/class-diagram.svg
Normal file
1
pipeline/diagrams/class-diagram.svg
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user