diff --git a/pipeline/ETHICS-AND-KNOWN-ISSUES.md b/pipeline/ETHICS-AND-KNOWN-ISSUES.md new file mode 100644 index 0000000..f218e73 --- /dev/null +++ b/pipeline/ETHICS-AND-KNOWN-ISSUES.md @@ -0,0 +1,324 @@ +# Ethics, Bias, and Known Issues + +This document covers the ethical context of the Biergarten Pipeline's output, +the model's biases, and known issues including hallucinated brewing science and +low-resource language failures. + +> Note that all testing was used using `google_gemma-4-E4B-it-Q6_K.gguf`. + +## Table of Contents + +- [What This Dataset Is](#what-this-dataset-is) +- [What This Dataset Is Not](#what-this-dataset-is-not) +- [Model Bias and Language Quality](#model-bias-and-language-quality) +- [Western and Eurocentric Lens](#western-and-eurocentric-lens) +- [Wikipedia Enrichment](#wikipedia-enrichment) +- [The "Avoid AI Phrases" Prompt Instruction](#the-avoid-ai-phrases-prompt-instruction) +- [Known Issues](#known-issues) + - [Hallucinated Brewing Techniques](#hallucinated-brewing-techniques) + - [Low-Resource Language Hallucination](#low-resource-language-hallucination) + +--- + +## What This Dataset Is + +This is AI-generated fixture data for a proof-of-concept version of The +Biergarten App. Anyone who interacts with an application seeded from this +pipeline must be told upfront that the content is AI-generated. + +--- + +## What This Dataset Is Not + +The pipeline is not intended to produce accurate brewing science, faithful +cultural representation, or reliable local-language text. Hallucinations such as +invented fermentation techniques, or incoherent local-language prose, are +expected, observed, and partially documented in [Known Issues](#known-issues) +below. + +Human control sits at the context layer (i.e. prompt design, Wikipedia +enrichment). Statistical output shapes in future pipeline stages (check-in +distributions, rating skews, activity profiles) will be handled the same way. + +**Treat this data as an exercise in prompt engineering and model behaviour, not +as a source of truth for brewing techniques or cultural representation.** + +**Natural language processing, although a powerful tool for data analysis and +generation is to be taken with scrutiny. Human language is not simply just data +points to be analyzed, but it also carries deep cultural and human meaning that +artificial intelligence is incapable of.** + +--- + +## Model Bias and Language Quality + +The underlying model's training biases surface within this pipeline. +Output quality tracks with how well a language is represented in the training +corpus: standard French (`fr-FR`) produces coherent text; regional variants like +`fr-CD` and `fr-CI` are noticeably weaker; low-resource languages like Welsh, +Māori, and Sicilian produce output that is syntactically plausible but often +semantically broken. + +This is a property of the training distribution, not something that can be +mitigated through prompt design. This is a well-documented characteristic of +large language models trained predominantly on English-language +material.[^llm-bias] + +Mitigations are documented in +[Known Issues: Low-Resource Language Hallucination](#low-resource-language-hallucination). + +### Western and Eurocentric Lens + +The model's training data skews heavily Western and North American. When +generating brewery descriptions for Kinshasa, Abidjan, or Osaka, for example, it +defaults to framing and cultural reference points drawn from that perspective +rather than from the lived context of those cities. Wikipedia enrichment grounds +some generation in city-specific material, but it does not eliminate the skew. + +**Output should be read with an understanding of this bias.** + +--- + +## Wikipedia Enrichment + +City and beer context is fetched from the Wikipedia API. Wikipedia text is +co-licensed under the **Creative Commons Attribution-ShareAlike 4.0 +International License (CC BY-SA 4.0)** and the **GNU Free Documentation License +(GFDL)**.[^wp-license] + +Wikipedia's own accuracy limitations and editorial biases can propagate into +generated descriptions. + +--- + +## The "Avoid AI Phrases" Prompt Instruction + +The system prompt instructs the model to avoid common AI-generated phrasing +patterns. This is a prompt engineering experiment: + +> How far can a model be pushed against its own stylistic defaults? + +This is not an attempt to disguise the content as human-written. All downstream +consumers are informed of the AI-generated origin before engagement. + +--- + +## Known Issues + +### Hallucinated Brewing Techniques + +When forced by the system prompt to generate a "highly specialized technical +brewing detail," the model frequently hallucinates fermentation science and +brewing chemistry. While the resulting sounds confident, it will be nonsensical +to reader with brewing and/or scientific expertise. + +Small-parameter models such as Gemma 4 E4B cannot reliably satisfy prompts +calling for specialist brewing detail. This is consistent with the **CHOKE** +failure mode (Certain Hallucinations Overriding Known Evidence) [^llm-choke] +where a specific-sounding prompt causes a model to produce confident, coherent, +and factually wrong output. + +#### Example — Osaka, produced using Gemma 4 E4B + +```log +[2026-04-21 15:04:40.258] [info] 11. city="Osaka" country="Japan" state="Osaka" iso3166_2=JP-27 lat=34.6937 lon=135.5023 +[2026-04-21 15:04:40.258] [info] brewery_name_en="Kani-no-Kuni Brewing" +[2026-04-21 15:04:40.258] [info] brewery_description_en="The humid scent of takoyaki and savory grilled squid always settles over the neighborhood early in the morning, mingling strangely with the metallic tang of spent grain outside our drying shed. We decided to build this place here because Osaka is a city that never pauses, a place built around the constant exchange of goods and tastes, and that is what we want our beer to reflect. Many people here are accustomed to the dependable, clean taste of the major pale lagers, but we are interested in exploring the deeper, more textured expressions of fermentation. Our base malts are specifically crafted by incorporating strains of local rice, like Yamada Nishiki, which we malt and use not for sake, but for its complex, buttery flavor profile during the mash. This combination, followed by a slow, anaerobic aging process, gives our beers a profound, satisfying depth that stands apart from the hurried mass-market style. The heavy, deep red brick of the original warehouse wall has absorbed decades of Kansai humidity and seasonal rains, creating a patina that tells the exact story of this district's tireless movement. We chose this structure not for its charm, but for its resilience and the sheer density of the local history held within its mortar. Our goal is simply to serve a drink worthy of this powerful trading city. If you are looking for a quiet spot away from the main thoroughfare, look for us just off the side street near Shinsekai." +[2026-04-21 15:04:40.258] [info] brewery_name_local="カニの国ブルワリー" +[2026-04-21 15:04:40.258] [info] brewery_description_local="早朝の、たこ焼きや香ばしいイカ焼きの湿った匂いは、いつも乾燥小屋の外にある使用済み麦芽の金属的な匂いと奇妙に混ざり合って近隣に漂います。私たちはこの場所に店を構えることを決めたのです。なぜなら、大阪は決して止まることのない都市であり、商品と味が絶え間なく交換されることで築かれた場所だからです。地元の多くの方々は、信頼できる大規模な淡麗ラガーの味が習慣になっていますが、私たちは発酵の、より深く、より複雑な表現を探求することに関心があります。私たちのベースモルトは、山田錦のような地元の米の品種を意図的に組み込んで作られています。この米を酒ではなく、麦芽として、仕込みの最中にその複雑でバターのような風味を引き出すために使用しています。この組み合わせを、ゆっくりとした嫌気的な熟成プロセスに続けることで、私たちのビールは、慌ただしい市場のスタイルとは一線を画す、深みのある、満足感のある複雑さを持っています。オリジナルの倉庫の重く深紅のレンガ壁は、関西特有の湿気と季節の雨を何十年も吸収し、この地区の絶え間ない動きの正確な物語を語るような古色を帯びています。私たちはこの構造物を、その魅力のためではなく、その回復力とモルタルに込められた地域の歴史の密度ゆえに選びました。私たちの目標は、ただこの力強い交易都市に値する飲み物を提供することだけです。もしメインの通りから離れた静かな場所をお探しなら、新世界近くの脇道にある私たちを探してください。" +``` + +A review of the following text for brewing techniques reveals several +inaccuracies, and no comments could be made on the local-language version due to +my own lack of proficiency in Japanese: + +#### 1. "Buttery flavours" framed as a desirable malt-derived flavour + +**Incorrect.** + +Diacetyl is a fermentation byproduct of yeast metabolism, not a malt-derived +compound.[^diacetyl-source] Diacetyl produces a buttery or butterscotch +off-flavour and is carefully managed in many beer styles, in particular lighter +beers, through a process called a _diacetyl rest_. In this process, fermentation +temperature is briefly raised to allow yeast to reabsorb the compound before +packaging.[^diacetyl-rest] + +The Oxford Companion to Beer claims that, while low levels are tolerable in some +ales and stouts, diacetyl is considered undesirable at any perceptible +concentration when it results from bacterial contamination or stressed +fermentation.[^oxford-beer] + +#### 2. Yamada Nishiki sake rice described as a self-saccharifying base malt + +**Incorrect.** + +Yamada Nishiki (_山田錦_) is a short-grain Japanese rice bred specifically for +sake production.[^yn-wiki] Its value lies in its large starchy core +(_shinpaku_), low protein content, and amenability to _koji_ mold penetration +during saccharification.[^yn-sakestreet] Sake brewing does not use the grain's +own enzymatic activity for saccharification — it relies on _Aspergillus oryzae_ +(koji mold) grown on a portion of the steamed rice to convert starches to +fermentable sugars.[^yn-sakeonline] + +#### 3. "Anaerobic aging" presented as a differentiating technique + +**Misleading** + +Anaerobic conditions during packaging and aging are not differentiating +technique. Anaerobic conditions are the standard baseline for all commercial +beer production. Breweries exclude oxygen as a top priority for packaging and +shelf stability; published research in _Microbiology Spectrum_ confirms that +packaged beer constitutes an anaerobic environment by definition.[^anaerobic] +Professional packaging lines use CO_2 purges and closed transfers specifically +to maintain this state.[^packaging] Framing anaerobic aging as a distinctive +practice is misleading and suggests hallucinated output. + +### Low-Resource Language Hallucination + +The generation pipeline passes local language codes to the model to retrieve a +translated `description_local`. Output quality is reliable for high-resource +languages such as French, though it may struggle with regional variants and +idiomatic phrasing. + +```json +[ + { + "city": "Kinshasa", + "state_province": "Kinshasa", + "iso3166_2": "CD-KN", + "country": "Democratic Republic of the Congo", + "iso3166_1": "CD", + "latitude": -4.4419, + "longitude": 15.2663, + "local_languages": ["fr-CD", "ln"] + }, + { + "city": "Paris", + "state_province": "Île-de-France", + "iso3166_2": "FR-IDF", + "country": "France", + "iso3166_1": "FR", + "latitude": 48.8566, + "longitude": 2.3522, + "local_languages": ["fr-FR"] + }, + { + "city": "Abidjan", + "state_province": "Abidjan", + "iso3166_2": "CI-AB", + "country": "Ivory Coast", + "iso3166_1": "CI", + "latitude": 5.36, + "longitude": -4.0083, + "local_languages": ["fr-CI"] + }, + { + "city": "Montreal", + "state_province": "Quebec", + "iso3166_2": "CA-QC", + "country": "Canada", + "iso3166_1": "CA", + "latitude": 45.5017, + "longitude": -73.5673, + "local_languages": ["fr-CA"] + }, + { + "city": "Brussels", + "state_province": "Brussels-Capital Region", + "iso3166_2": "BE-BRU", + "country": "Belgium", + "iso3166_1": "BE", + "latitude": 50.8503, + "longitude": 4.3517, + "local_languages": ["fr-BE", "nl-BE"] + } +] +``` + +This dataset, when fed into the pipeline will often times reason that a local variant of French is needed, but will often times just default to a standardized dialect of French, devoid of any cultural or linguistic nuance. + +For languages such as Welsh (Wales), Māori (Aotearoa/New Zealand), or Sicilian +(Sicily, Italy), the model can generate text that looks syntactically plausible +but is semantically incoherent. This comes from limited training-data coverage +rather than prompt engineering. + +Output sample: +[./out-sample/french-cities.example](out-sample/french-cities.example) + +#### Proposed Mitigations + +- **Prevention via allowlist:** introduce a high-resource language allowlist. If + a location's code is unlisted, skip `description_local` generation and fall + back to English. +- **Upstream sanitization:** strip known low-resource language codes from the + `locations.json` payload before generation. +- **Downstream flagging:** add a `description_local_confidence` column to the + SQLite schema so downstream applications can filter or flag potentially + hallucinated text by language tier. + +--- + +## Footnotes + +[^llm-choke]: CHOKE (Certain Hallucinations Overriding Known Evidence) is a hallucination failure mode defined by Simhi et al. (2025), in which a model that can consistently answer a question correctly produces a confident, wrong response when the prompt is trivially perturbed. Source: Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer — Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov. + +[^llm-bias]: + e.g., Blasi et al. (2022), "Systematic Inequalities in Language Technology + Performance across the World's Languages," _ACL Anthology_. The pattern is + consistent with models trained predominantly on English-language web + corpora. + +[^wp-license]: + Source: + [Wikipedia:FAQ/Copyright](https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright). + +[^cc-sa]: + Creative Commons CC BY-SA 4.0 deed: "If you remix, transform, or build upon + the material, you must distribute your contributions under the same license + as the original." Source: + [creativecommons.org/licenses/by-sa/4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en). + +[^diacetyl-source]: + White Labs confirms that diacetyl is a yeast-derived fermentation byproduct: specifically, a compound produced during amino acid metabolism that leaks out of the yeast cell and oxidises into its characteristic buttery off-flavour. It is generally considered undesirable at any perceived level in most styles, though low levels are tolerated in some English ales and European lagers. + Source: + [whitelabs.com — Compound Spotlight: Diacetyl](https://www.whitelabs.com/news-update-detail?id=54). + +[^diacetyl-rest]: + Brewing Science Institute: diacetyl "is produced during the fermentation + process, primarily as a byproduct of yeast metabolism… generally considered + a flaw in most beer styles." Source: + [brewingscience.com — Diacetyl: Understanding Its Role as an Off-Flavor in Beer](https://brewingscience.com/diacetyl-understanding-its-role-as-an-off-flavor-in-beer/). + +[^oxford-beer]: + Oxford Companion to Beer via _Beer & Brewing_: "At low to moderate levels, + diacetyl can be perceived as a positive flavor characteristic in some ales + and stouts" but "particularly unwelcome in lager-style beers." Source: + [beerandbrewing.com — diacetyl](https://www.beerandbrewing.com/dictionary/48TDqQibPi). + +[^yn-wiki]: + Wikipedia: "Yamada Nishiki (山田錦) is a short-grain Japanese rice famous + for its use in high-quality sake." Source: + [en.wikipedia.org/wiki/Yamada_Nishiki](https://en.wikipedia.org/wiki/Yamada_Nishiki). + +[^yn-sakestreet]: + Sake Street: Yamadanishiki's large _shinpaku_ allows koji mold to penetrate + to the centre of the rice grain, making it "particularly suitable for + producing good koji." Source: + [sakestreet.com — What is Yamadanishiki?](https://sakestreet.com/en/media/what-is-yamadanishiki). + +[^yn-sakeonline]: + Sake Online: "Steamed rice is added to make koji (rice malt) and yeast + starter, which promotes alcohol fermentation." Source: + [sakeonline.com.au — Types of Sake Rice: Yamada Nishiki](https://sakeonline.com.au/blogs/news/types-of-sake-rice-yamada-nishiki-and-its-characteristics). + +[^anaerobic]: + Pai et al. (2022): "Breweries have recognized oxygen exclusion as a top + priority for the proper packaging and aging of beer… packaged beer is an + anaerobic environment." _Microbiology Spectrum._ Source: + [journals.asm.org](https://journals.asm.org/doi/10.1128/spectrum.02656-22). + +[^packaging]: + Beer Production Processes (oboe.com): Professional packaging lines use + double CO_2 pre-evacuation cycles and closed transfers "so the beer moves in + a completely anaerobic environment." Source: + [oboe.com — Flavor Quality Control](https://oboe.com/learn/beer-production-processes-308lmf/flavor-quality-control-4). diff --git a/pipeline/README.md b/pipeline/README.md index 103d9dd..ee23d72 100644 --- a/pipeline/README.md +++ b/pipeline/README.md @@ -1,34 +1,42 @@ # Biergarten Pipeline -A C++20 command-line pipeline that samples city records from local JSON, enriches each with Wikipedia context, and generates bilingual brewery names and descriptions via a local GGUF model or a deterministic mock. +A C++20 command-line pipeline that samples city records from local JSON, +enriches each with Wikipedia context, and generates bilingual brewery names and +descriptions via a local GGUF model or a deterministic mock. + +> **This pipeline produces AI-generated data.** It is not a source of truth for +> brewing techniques, cultural representation, or local-language accuracy. See +> [ETHICS-AND-KNOWN-ISSUES.md](ETHICS-AND-KNOWN-ISSUES.md) for full +> documentation of limitations, hallucination patterns, and bias. --- ## Table of Contents - [How It Fits The Main App](#how-it-fits-the-main-app) -- [Tech Stack](#tech-stack) -- [Build](#build) -- [Model](#model) -- [Run](#run) +- [Quick Start](#quick-start) + - [Build](#build) + - [Model](#model) + - [Run](#run) - [Architecture](#architecture) - [Pipeline Stages](#pipeline-stages) - [Key Components](#key-components) - [Runtime Behaviour](#runtime-behaviour) - [Generated Output](#generated-output) -- [Language Generation Quality](#language-generation-quality) - - [Known Issues](#known-issues) +- [Tech Stack](#tech-stack) - [Tested Hardware](#tested-hardware) +- [Fixture Strategy](#fixture-strategy) - [Repo Layout](#repo-layout) - [Code Tour](#code-tour) -- [Fixture Strategy](#fixture-strategy) - [Next Steps](#next-steps) --- ## How It Fits The Main App -The pipeline is a data ingestion layer. It sits outside the web app runtime and produces seed records the app imports at startup or during a dedicated seed step. +The pipeline is a data ingestion layer. It sits outside the web app runtime and +produces seed records the app imports at startup or during a dedicated seed +step. | Planned app area | Pipeline contribution | | -------------------------------- | ------------------------------------------------------------------ | @@ -39,35 +47,20 @@ The pipeline is a data ingestion layer. It sits outside the web app runtime and --- -## Tech Stack +## Quick Start -- C++20 -- CMake 3.24+ -- Boost.JSON, Boost.ProgramOptions, Boost.DI -- spdlog -- libcurl -- SQLite amalgamation fetched and compiled via CMake FetchContent -- llama.cpp +### Build -The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit is present. - -> **Code Style:** Modern C++20 throughout - RAII for ownership, `std::unique_ptr` for injected dependencies, `std::optional` for parse outcomes, `std::span` for read-only views over generated city data, structured bindings in pipeline loops. Formatting follows the Google C++ Style Guide via `.clang-format` with a narrow column limit and two-space indentation. - ---- - -## Build - -Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and ProgramOptions). -SQLite is fetched from the upstream amalgamation, so no system SQLite package is required. +Requirements: C++20 compiler, CMake 3.24+, libcurl, Boost (JSON and +ProgramOptions). SQLite is fetched from the upstream amalgamation, so no system +SQLite package is required. ```bash cmake -S . -B build cmake --build build ``` ---- - -## Model +### Model > Skip this step if you only need `--mocked`. @@ -78,18 +71,18 @@ curl -L \ https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q6_K.gguf?download=true ``` ---- +### Run -## Run - -Run from `build/` so the copied `locations.json` and `prompts/` are available. Each run also writes a fresh dated SQLite file such as `biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. +Run from `build/` so the copied `locations.json` and `prompts/` are available. +Each run also writes a fresh dated SQLite file such as +`biergarten_seed_2026-04-19T15-30-45.123456Z.sqlite` into the working directory. ```bash ./biergarten-pipeline --mocked ./biergarten-pipeline --model models/google_gemma-4-E4B-it-Q6_K.gguf --temperature 1.0 --top-p 0.95 --top-k 64 --n-ctx 8192 --seed -1 ``` -### CLI Flags +#### CLI Flags | Flag | Purpose | | --------------- | ------------------------------------------------------- | @@ -102,9 +95,12 @@ Run from `build/` so the copied `locations.json` and `prompts/` are available. E | `--seed` | Random seed. Default: `-1` (random at runtime). | | `--help, -h` | Print usage and exit. | -`--mocked` and `--model` are mutually exclusive. Omitting both exits with an error before the pipeline starts. Sampling flags are ignored when `--mocked` is set. +`--mocked` and `--model` are mutually exclusive. Omitting both exits with an +error before the pipeline starts. Sampling flags are ignored when `--mocked` is +set. -The post-build step copies `prompts/` into `build/prompts/`. Rebuild after editing `prompts/system.md`. +The post-build step copies `prompts/` into `build/prompts/`. Rebuild after +editing `prompts/system.md`. --- @@ -121,41 +117,58 @@ The post-build step copies `prompts/` into `build/prompts/`. Rebuild after editi | Store | `SqliteExportService` writes each successful brewery into a fresh dated `.sqlite` database with normalized location and brewery tables. | | Log | `spdlog` writes results and warnings to the console. | -If enrichment or generation fails for a city, that city is skipped and the pipeline continues. +If enrichment or generation fails for a city, that city is skipped and the +pipeline continues. ### Key Components -- `src/main.cc` - argument parsing and Boost.DI composition root. -- `JsonLoader` - validates curated location input. -- `WikipediaService` - queries Wikipedia extracts, caches results, returns empty context on failure. -- `LlamaGenerator` - formats prompts for Gemma 4, validates JSON output, retries malformed responses up to three times. If output looks truncated, the retry raises the token budget before trying again. -- `MockGenerator` - stable hash-based output so the same city input always produces the same brewery. -- `SqliteExportService` - creates a dated SQLite file per run and persists each successful brewery into normalized tables. -- Brewery payloads include English and local-language name and description fields. +- `src/main.cc` — argument parsing and Boost.DI composition root. +- `JsonLoader` — validates curated location input. +- `WikipediaService` — queries Wikipedia extracts, caches results, returns empty + context on failure. +- `LlamaGenerator` — formats prompts for Gemma 4, validates JSON output, retries + malformed responses up to three times. If output looks truncated, the retry + raises the token budget before trying again. +- `MockGenerator` — stable hash-based output so the same city input always + produces the same brewery. +- `SqliteExportService` — creates a dated SQLite file per run and persists each + successful brewery into normalized tables. +- Brewery payloads include English and local-language name and description + fields. ### Runtime Behaviour -`WikipediaService` queries city, country, and beer-related Wikipedia extracts using its configured lookup, then caches the first successful response per query string. The fetched extract text is included in the prompt as context for generation. +`WikipediaService` queries city, country, and beer-related Wikipedia extracts +using its configured lookup, then caches the first successful response per query +string. The fetched extract text is included in the prompt as context for +generation. -`GetLocationContext()` returns an empty string when the web client is unavailable or when lookup/parsing fails. +`GetLocationContext()` returns an empty string when the web client is +unavailable or when lookup/parsing fails. -`LlamaGenerator` validates model output as structured JSON. The retry path exists as a safety hatch for cases where the reasoning block consumes available token budget and compresses the JSON output space. All runs to date have produced valid output on the first pass; the path is kept for resilience. +`LlamaGenerator` validates model output as structured JSON. The retry path +exists as a safety hatch for cases where the reasoning block consumes available +token budget and compresses the JSON output space. All runs to date have +produced valid output on the first pass; the path is kept for resilience. -`MockGenerator` uses stable hashes for repeatable output in demos and Storybook runs. +`MockGenerator` uses stable hashes for repeatable output in demos and Storybook +runs. ### Process Flow - Activity Diagram -![An activity diagram](./diagrams/activity-diagram.svg) +![An activity diagram](./diagrams/current/output/activity.svg) ### Architectural Overview - Class Diagram -![A class diagram](./diagrams/class-diagram.svg) +![A class diagram](./diagrams/current/output/class.svg) --- ## Generated Output -Each successful run stores a `GeneratedBrewery` pair with the source location and a `BreweryResult` payload. The same generated records are also written to a fresh SQLite export file named with the current UTC timestamp. +Each successful run stores a `GeneratedBrewery` pair with the source location +and a `BreweryResult` payload. The same generated records are also written to a +fresh SQLite export file named with the current UTC timestamp. | Field | Meaning | | ------------------- | ------------------------------------------ | @@ -164,7 +177,8 @@ Each successful run stores a `GeneratedBrewery` pair with the source location an | `name_local` | Brewery name in the local language. | | `description_local` | Brewery description in the local language. | -The log dump also includes city, country, state or province, ISO subdivision code, latitude, and longitude for each entry. +The log dump also includes city, country, state or province, ISO subdivision +code, latitude, and longitude for each entry. ### Consumer Data Shape @@ -180,80 +194,25 @@ The log dump also includes city, country, state or province, ISO subdivision cod --- -## Language Generation Quality +## Tech Stack -The generation pipeline passes local language codes to the model to retrieve a translated `description_local`. +- C++20 +- CMake 3.24+ +- Boost.JSON, Boost.ProgramOptions, Boost.DI +- spdlog +- libcurl +- SQLite amalgamation fetched and compiled via CMake FetchContent +- llama.cpp -Output quality is reliable for high-resource languages such as French, though it may struggle with regional variants and idiomatic phrasing. This can be seen with these data points: +The build fetches Boost.DI, spdlog, llama.cpp, and SQLite via CMake. Metal is +enabled on Apple Silicon; CUDA or HIP/ROCm is detected on Linux when the toolkit +is present. -```json -[ - { - "city": "Kinshasa", - "state_province": "Kinshasa", - "iso3166_2": "CD-KN", - "country": "Democratic Republic of the Congo", - "iso3166_1": "CD", - "latitude": -4.4419, - "longitude": 15.2663, - "local_languages": ["fr-CD", "ln"] - }, - { - "city": "Paris", - "state_province": "Île-de-France", - "iso3166_2": "FR-IDF", - "country": "France", - "iso3166_1": "FR", - "latitude": 48.8566, - "longitude": 2.3522, - "local_languages": ["fr-FR"] - }, - { - "city": "Abidjan", - "state_province": "Abidjan", - "iso3166_2": "CI-AB", - "country": "Ivory Coast", - "iso3166_1": "CI", - "latitude": 5.36, - "longitude": -4.0083, - "local_languages": ["fr-CI"] - }, - { - "city": "Montreal", - "state_province": "Quebec", - "iso3166_2": "CA-QC", - "country": "Canada", - "iso3166_1": "CA", - "latitude": 45.5017, - "longitude": -73.5673, - "local_languages": ["fr-CA"] - }, - { - "city": "Brussels", - "state_province": "Brussels-Capital Region", - "iso3166_2": "BE-BRU", - "country": "Belgium", - "iso3166_1": "BE", - "latitude": 50.8503, - "longitude": 4.3517, - "local_languages": ["fr-BE", "nl-BE"] - } -] -``` - -Output sample: [./out-sample/french-cities.example](out-sample/french-cities.example) - -### Known Issues - -#### Low-Resource Language Hallucination - -For languages such as Welsh (Wales), Maori (Aotearoa/New Zealand), or Sicilian (Sicily, Italy), the model can generate text that looks syntactically plausible but is semantically incoherent. This comes from limited training-data coverage rather than prompt engineering. - -#### Proposed Mitigations - -- **Prevention via allowlist:** introduce a high-resource language allowlist. If a location's code is unlisted, skip `description_local` generation and fall back to English. -- **Upstream sanitization:** strip known low-resource language codes from the `locations.json` payload before generation. -- **Downstream flagging:** add a `description_local_confidence` column to the SQLite schema so downstream applications can filter or flag potentially hallucinated text by language tier. +> **Code Style:** Modern C++20 throughout — RAII for ownership, +> `std::unique_ptr` for injected dependencies, `std::optional` for parse +> outcomes, `std::span` for read-only views over generated city data, structured +> bindings in pipeline loops. Formatting follows the Google C++ Style Guide via +> `.clang-format` with a narrow column limit and two-space indentation. --- @@ -283,62 +242,83 @@ For languages such as Welsh (Wales), Maori (Aotearoa/New Zealand), or Sicilian ( --- +## Fixture Strategy + +- `--mocked` for stable fixtures, repeatable screenshots, and Storybook runs. +- `--model` when geographically grounded content matters for demos. +- Keep `locations.json` structured enough to support discovery and future + filtering. +- Treat SQLite output as seed material for the app's brewery domain, not + production data. + +--- + ## Repo Layout -| Path | Purpose | -| ---------------- | ---------------------------------------------- | -| `includes/` | Public headers and shared models. | -| `src/` | Implementation files. | -| `locations.json` | Curated city input copied into the build tree. | -| `prompts/` | System prompt used by the model-backed path. | -| `diagrams/` | Architecture and pipeline diagrams. | +| Path | Purpose | +| ---------------------------- | -------------------------------------------------- | +| `includes/` | Public headers and shared models. | +| `src/` | Implementation files. | +| `locations.json` | Curated city input copied into the build tree. | +| `prompts/` | System prompt used by the model-backed path. | +| `diagrams/` | Architecture and pipeline diagrams. | +| `ETHICS-AND-KNOWN-ISSUES.md` | Ethics, bias, hallucination analysis, mitigations. | --- ## Code Tour -- `src/main.cc` - argument parsing and DI composition root. -- `src/biergarten_data_generator/` - orchestration, sampling, logging, and export. -- `src/services/wikipedia/` - enrichment service and cache. -- `src/services/sqlite/` - SQLite export implementation. -- `src/data_generation/llama/` - local inference, prompt loading, output validation. -- `src/data_generation/mock/` - deterministic fallback. - ---- - -## Fixture Strategy - -- `--mocked` for stable fixtures, repeatable screenshots, and Storybook runs. -- `--model` when geographically grounded content matters for demos. -- Keep `locations.json` structured enough to support discovery and future filtering. -- Treat SQLite output as seed material for the app's brewery domain, not production data. +- `src/main.cc` — argument parsing and DI composition root. +- `src/biergarten_data_generator/` — orchestration, sampling, logging, and + export. +- `src/services/wikipedia/` — enrichment service and cache. +- `src/services/sqlite/` — SQLite export implementation. +- `src/data_generation/llama/` — local inference, prompt loading, output + validation. +- `src/data_generation/mock/` — deterministic fallback. --- ## Next Steps -The pipeline currently produces city-aware brewery records and dated SQLite exports. The next passes add additional fixture types so the app can exercise the full brewery domain without live data. +The pipeline currently produces city-aware brewery records and dated SQLite +exports. The next passes add additional fixture types so the app can exercise +the full brewery domain without live data. -### Testing _(Very High Importance)_ +### Testing — Very High Priority -- Unit test JSON validation and retry logic against malformed, truncated, and empty model outputs. -- Integration test the enrichment pipeline with missing context, short context, and fake context inputs. -- Adversarial context tests: feed plausible but geographically incorrect Wikipedia extracts and verify the model does not silently blend them with training data. -- Verify bilingual enrichment behaviour when only an English extract is available versus when both extracts are present. -- Confirm the retry path is reachable when the reasoning block consumes available token budget. +- Unit test JSON validation and retry logic against malformed, truncated, and + empty model outputs. +- Integration test the enrichment pipeline with missing context, short context, + and fake context inputs. +- Adversarial context tests: feed plausible but geographically incorrect + Wikipedia extracts and verify the model does not silently blend them with + training data. +- Verify bilingual enrichment behaviour when only an English extract is + available versus when both extracts are present. +- Confirm the retry path is reachable when the reasoning block consumes + available token budget. ### Beer Generation -Generate catalog entries with style, ABV, IBU, color, aroma notes, and food pairing hints. Link beers back to breweries and cities. Keep style coverage wide enough to exercise search, sort, and category filters. +Generate catalog entries with style, ABV, IBU, color, aroma notes, and food +pairing hints. Link beers back to breweries and cities. Keep style coverage wide +enough to exercise search, sort, and category filters. ### User Generation -Generate user profiles with stable names, bios, locale hints, and preference signals. Include stable IDs for downstream fixture joins. Keep output deterministic for screenshots while allowing larger randomized batches. +Generate user profiles with stable names, bios, locale hints, and preference +signals. Include stable IDs for downstream fixture joins. Keep output +deterministic for screenshots while allowing larger randomized batches. ### Check-In System -Produce timestamped check-in events between users and breweries. Use a J-curve activity profile - a small set of users accounts for most check-ins, the rest appear occasionally. Add bursty behaviour around weekends and travel periods. +Produce timestamped check-in events between users and breweries. Use a J-curve +activity profile — a small set of users accounts for most check-ins, the rest +appear occasionally. Add bursty behaviour around weekends and travel periods. ### Beer Ratings -Generate rating events with a strong positive skew and a long tail of lower scores. Avoid uniform distributions. Attach timestamps and user IDs so the app can compute averages, trends, and per-style comparisons. +Generate rating events with a strong positive skew and a long tail of lower +scores. Avoid uniform distributions. Attach timestamps and user IDs so the app +can compute averages, trends, and per-style comparisons. diff --git a/pipeline/diagrams/activity-diagram.svg b/pipeline/diagrams/activity-diagram.svg deleted file mode 100644 index 0ec4f31..0000000 --- a/pipeline/diagrams/activity-diagram.svg +++ /dev/null @@ -1 +0,0 @@ -The Biergarten Data PipelineThe Biergarten Data PipelineValidatesmocked,model,temperature,top-p, etc.ParseArguments(argc, argv)spdlog::error usage infonoAre arguments valid?yesInit CurlGlobalState & LlamaBackendStateBinds CURLWebClient, WikipediaService,Gemma4JinjaPromptFormatter, andeither MockGenerator or LlamaGeneratordi::make_injector(...)injector.create<BiergartenDataGenerator>()BiergartenDataGenerator::Run()Return 0QueryCitiesWithCountries()Lookup failed?yesnospdlog::warn "context lookup failed"Store EnrichedCity{Location, region_context}Remaining citiesFor each sampled Location?DoneGenerateBreweries(enriched_cities)Exception thrown?yesnospdlog::warn "brewery generation failed"Store GeneratedBreweryspdlog::info dump of generated JSON fieldsLogResults()JsonLoader::LoadLocations("locations.json")std::ranges::sample(all_locations, 50)GetLocationContext(loc)FetchExtract("City, Country")FetchExtract("beer in Country")Backed by CURLWebClient::GetFetchExtract("beer in City")Generator ModeMockGeneratorLlamaGeneratorDeterministicHash(location)Select from kBreweryAdjectives, kBreweryNouns,kBreweryDescriptionsFormat BreweryResultPrepareRegionContext(region_context)LoadBrewerySystemPrompt("prompts/system.md")Format user_promptAttempt = 0Uses Gemma4JinjaPromptFormatter,llama_tokenize, and llama_sampler_sampleInfer(system_prompt, user_prompt, max_tokens, kBreweryJsonGrammar)ValidateBreweryJson(raw, brewery)Is JSON Valid?yesnomax_tokens += 700yesError == "incomplete JSON"Update user_prompt with validation errorAttempt++Attempt < 3?yesStill Invalid?yesnothrow std::runtime_errorReturn BreweryResultRemaining citiesFor each EnrichedCity?Donemain.ccBiergartenDataGeneratorJsonLoaderWikipediaServiceDataGenerator \ No newline at end of file diff --git a/pipeline/diagrams/biergarten-weizen-theme.puml b/pipeline/diagrams/biergarten-weizen-theme.puml new file mode 100644 index 0000000..b31305d --- /dev/null +++ b/pipeline/diagrams/biergarten-weizen-theme.puml @@ -0,0 +1,34 @@ +skinparam shadowing false +skinparam backgroundColor #FCFCF7 +skinparam defaultFontName "DM Sans" +skinparam defaultFontColor #14180C +skinparam titleFontName "Volkhov" +skinparam titleFontColor #14180C +skinparam ArrowColor #656F33 +skinparam NoteBackgroundColor #DBEEDD +skinparam NoteFontColor #14180C +skinparam NoteBorderColor #4A5837 +skinparam SwimlaneBorderColor #4A5837 +skinparam SwimlaneBorderThickness 1 +skinparam activityStartColor #EBECE3 +skinparam activityEndColor #4A5837 +skinparam activityStopColor #4A5837 +skinparam ActivityBackgroundColor #EBECE3 +skinparam ActivityBorderColor #4A5837 +skinparam ActivityDiamondBackgroundColor #CBD2B5 +skinparam ActivityDiamondBorderColor #4A5837 +skinparam packageStyle rectangle +skinparam packageBackgroundColor #F1F3EA +skinparam packageBorderColor #4A5837 +skinparam packageFontColor #14180C +skinparam classBackgroundColor #EBECE3 +skinparam classBorderColor #4A5837 +skinparam classFontColor #14180C +skinparam classAttributeFontColor #3F4724 +skinparam classStereotypeFontColor #4A5837 +skinparam interfaceBackgroundColor #DBEEDD +skinparam interfaceBorderColor #4A5837 +skinparam interfaceFontColor #14180C +skinparam enumBackgroundColor #E4E6D8 +skinparam enumBorderColor #4A5837 +skinparam enumFontColor #14180C diff --git a/pipeline/diagrams/class-diagram.svg b/pipeline/diagrams/class-diagram.svg deleted file mode 100644 index b7cb713..0000000 --- a/pipeline/diagrams/class-diagram.svg +++ /dev/null @@ -1 +0,0 @@ -The Biergarten Data Pipeline - Class DiagramThe Biergarten Data Pipeline - Class DiagramBiergartenDataGeneratorcontext_service_ : std::unique_ptr<IEnrichmentService>generator_ : std::unique_ptr<DataGenerator>generated_breweries_ : std::vector<GeneratedBrewery>Run() : boolQueryCitiesWithCountries() : std::vector<Location>GenerateBreweries(cities : std::span<const EnrichedCity>) : voidLogResults() : void«interface»IEnrichmentServiceGetLocationContext(loc : const Location&) : std::stringWikipediaServiceclient_ : std::unique_ptr<WebClient>extract_cache_ : std::unordered_map<std::string, std::string>GetLocationContext(loc : const Location&) : std::stringFetchExtract(query : std::string_view) : std::string«interface»WebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::stringCURLWebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::string«interface»DataGeneratorGenerateBrewery(location : const Location&, region_context : const std::string&) : BreweryResultGenerateUser(locale : const std::string&) : UserResultMockGeneratorGenerateBrewery(...) : BreweryResultGenerateUser(...) : UserResultDeterministicHash(location : const Location&) : size_tLlamaGeneratormodel_ : ModelHandlecontext_ : ContextHandleprompt_formatter_ : std::unique_ptr<IPromptFormatter>rng_ : std::mt19937GenerateBrewery(...) : BreweryResultGenerateUser(...) : UserResultLoad(model_path : const std::string&) : voidInfer(...) : std::stringInferFormatted(...) : std::stringLoadBrewerySystemPrompt(...) : std::string«interface»IPromptFormatterFormat(system_prompt : std::string_view, user_prompt : std::string_view) : std::stringGemma4JinjaPromptFormatterFormat(system_prompt : std::string_view, user_prompt : std::string_view) : std::stringJsonLoaderLoadLocations(filepath : const std::filesystem::path&) : std::vector<Location>ownsownsimplementsownsimplementsimplementsimplementsusesimplementsuses \ No newline at end of file diff --git a/pipeline/diagrams/activity-diagram.puml b/pipeline/diagrams/current/activity.puml similarity index 100% rename from pipeline/diagrams/activity-diagram.puml rename to pipeline/diagrams/current/activity.puml diff --git a/pipeline/diagrams/class-diagram.puml b/pipeline/diagrams/current/class.puml similarity index 100% rename from pipeline/diagrams/class-diagram.puml rename to pipeline/diagrams/current/class.puml diff --git a/pipeline/diagrams/current/output/activity.svg b/pipeline/diagrams/current/output/activity.svg new file mode 100644 index 0000000..9b63f89 --- /dev/null +++ b/pipeline/diagrams/current/output/activity.svg @@ -0,0 +1 @@ +The Biergarten Data Pipeline (Streaming Architecture)The Biergarten Data Pipeline (Streaming Architecture)ParseArguments(argc, argv)spdlog::error usage infonoAre arguments valid?yesInit CurlGlobalState & LlamaBackendStatedi::make_injector(...)injector.create<std::unique_ptr<BiergartenDataGenerator>>()BiergartenDataGenerator::Run()Return 0Initialize SQLite exportQueryCitiesWithCountries()Store EnrichedCity{Location, region_context}Remaining citiesFor each sampled Location?DoneGenerateBreweries(enriched_cities)Generation successful?yesnoData loss is prevented per-record.The pipeline continues running.spdlog::warn "Failed to stream record to SQLite export"spdlog::warn "Generation failed, skipping..."GetUtcTimestamp() from SystemDateTimeProviderBuilds a fresh biergarten_seed_<UTC datetime>.sqlite filenameAppends a numeric suffix if the timestamp already existsOpens DB ConnectionExecutes Schema DDLBegins TransactionInitialize()ProcessRecord(GeneratedBrewery)Location in cache?yesnoReuse location_idInsert Location & Cache IDInsert Brewery (FK: location_id)yesException caught during insert?noCommits TransactionCloses Database ConnectionFinalize()JsonLoader::LoadLocations("locations.json")std::ranges::sample(all_locations, 50)GetLocationContext(loc)FetchExtracts(City, Country, Beer)Generator ModeMockGeneratorLlamaGeneratorDeterministicHash & FormatPrepareRegionContextLoadBrewerySystemPrompt("prompts/system.md")Infer(system_prompt, user_prompt, max_tokens, kBreweryJsonGrammar)ValidateBreweryJson(raw, brewery)Is JSON Valid?yesnoAttempt++Attempt < 3?yesRemaining citiesFor each EnrichedCity?Donemain.ccBiergartenDataGeneratorSqliteExportServiceJsonLoaderWikipediaServiceDataGenerator \ No newline at end of file diff --git a/pipeline/diagrams/current/output/class.svg b/pipeline/diagrams/current/output/class.svg new file mode 100644 index 0000000..a1426e2 --- /dev/null +++ b/pipeline/diagrams/current/output/class.svg @@ -0,0 +1 @@ +The Biergarten Data Pipeline - Class DiagramThe Biergarten Data Pipeline - Class DiagramBiergartenDataGeneratorcontext_service_ : std::unique_ptr<IEnrichmentService>generator_ : std::unique_ptr<DataGenerator>exporter_ : std::unique_ptr<IExportService>generated_breweries_ : std::vector<GeneratedBrewery>Run() : boolQueryCitiesWithCountries() : std::vector<Location>GenerateBreweries(cities : std::span<const EnrichedCity>) : voidLogResults() : void«interface»IEnrichmentServiceGetLocationContext(loc : const Location&) : std::stringWikipediaServiceclient_ : std::unique_ptr<WebClient>extract_cache_ : std::unordered_map<std::string, std::string>GetLocationContext(loc : const Location&) : std::stringFetchExtract(query : std::string_view) : std::string«interface»WebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::stringCURLWebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::string«interface»DataGeneratorGenerateBrewery(location : const Location&, region_context : const std::string&) : BreweryResultGenerateUser(locale : const std::string&) : UserResultMockGeneratorGenerateBrewery(...) : BreweryResultGenerateUser(...) : UserResultDeterministicHash(location : const Location&) : size_tLlamaGeneratormodel_ : ModelHandlecontext_ : ContextHandleprompt_formatter_ : std::unique_ptr<IPromptFormatter>rng_ : std::mt19937GenerateBrewery(...) : BreweryResultGenerateUser(...) : UserResultLoad(model_path : const std::string&) : voidInfer(...) : std::stringInferFormatted(...) : std::stringLoadBrewerySystemPrompt(...) : std::string«interface»IPromptFormatterFormat(system_prompt : std::string_view, user_prompt : std::string_view) : std::stringGemma4JinjaPromptFormatterFormat(system_prompt : std::string_view, user_prompt : std::string_view) : std::stringJsonLoaderLoadLocations(filepath : const std::filesystem::path&) : std::vector<Location>«interface»IExportServiceInitialize() : voidProcessRecord(brewery : const GeneratedBrewery&) : voidFinalize() : voidSqliteExportServicedate_time_provider_ : std::unique_ptr<IDateTimeProvider>run_timestamp_utc_ : std::stringdatabase_path_ : std::filesystem::pathdb_handle_ : sqlite3*insert_location_stmt_ : sqlite3_stmt*insert_brewery_stmt_ : sqlite3_stmt*transaction_open_ : boollocation_cache_ : std::unordered_map<std::string, sqlite3_int64>Initialize() : voidProcessRecord(brewery : const GeneratedBrewery&) : voidFinalize() : voidInitializeSchema() : void«interface»IDateTimeProviderGetUtcTimestamp() : std::stringSystemDateTimeProviderGetUtcTimestamp() : std::stringownsownsownsimplementsownsimplementsimplementsimplementsusesimplementsusesimplementsownsimplements \ No newline at end of file diff --git a/pipeline/diagrams/future-activity-diagram.puml b/pipeline/diagrams/future-activity-diagram.puml deleted file mode 100644 index d66fe99..0000000 --- a/pipeline/diagrams/future-activity-diagram.puml +++ /dev/null @@ -1,262 +0,0 @@ -@startuml biergarten_activity -skinparam defaultFontName "DM Sans" -skinparam defaultFontSize 13 -skinparam titleFontName "Volkhov" -skinparam titleFontSize 20 -skinparam backgroundColor #FCFCF7 -skinparam defaultFontColor #14180C -skinparam titleFontColor #14180C -skinparam ArrowColor #656F33 -skinparam activityStartColor #EBECE3 -skinparam activityEndColor #4A5837 -skinparam activityStopColor #4A5837 -skinparam ActivityBackgroundColor #EBECE3 -skinparam ActivityBorderColor #4A5837 -skinparam ActivityDiamondBackgroundColor #CBD2B5 -skinparam ActivityDiamondBorderColor #4A5837 -skinparam NoteBackgroundColor #DBEEDD -skinparam NoteFontColor #14180C -skinparam NoteBorderColor #4A5837 -skinparam SwimlaneBorderColor #4A5837 -skinparam SwimlaneBorderThickness 1 -skinparam monochrome reverse - - -title The Biergarten Data Pipeline — Activity Diagram - -|Main| -start -:ParseArguments(argc, argv); -if (Invalid args?) then (yes) - :spdlog::error; - stop -else (no) -endif -:Init CurlGlobalState & LlamaBackendState; -:Build DI injector; - - -:Initialize SqliteExportService; -note right - Opens SQLite connection. - Begins a single transaction - covering all five fixture types. -end note - -:Create BoundedChannel log_ch; -:Spawn Log Worker thread; -note right - Log worker drains log_ch for the - entire pipeline lifetime. - All workers emit LogEntry structs - via PipelineLogger — never spdlog directly. -end note - -:BiergartenPipelineOrchestrator::Run(); -|BiergartenPipelineOrchestrator::Run()| -:JsonLoader::LoadLocations("locations.json"); -:JsonLoader::LoadBeerStyles("beer-styles.json"); -:JsonLoader::LoadPersonas("personas.json"); -:JsonLoader::LoadNamesByCountry("names-by-country.json"); - -:EnrichmentService::PreWarmBeerStyleCache(beer_styles); -note right - Beer styles do not need location context. - Wikipedia summaries for the entire palette are - fetched and cached globally at startup. -end note - -:EnrichmentService::PreWarmPersonaCache(personas); -note right - Persona descriptions do not need location context. - All persona lookups are resolved and cached - globally at startup. -end note - - -' ═══════════════════════════════════════════ -' PHASE 0 — USER GENERATION -' ═══════════════════════════════════════════ -|Orchestrator| -:RunUserPhase(sampled_locations); -:Create BoundedChannels\n(loc_ch, llm_ch, exp_ch); - -fork - |Orchestrator| - :Loop: Send Locations → loc_ch; - :Close loc_ch; -fork again - |LLM Worker| - while (loc_ch has items?) is (yes) - :Receive Location; - - :IPersonaSelectionStrategy::SelectPersona(\n personas_palette_); - note right - Guaranteed cache hit from startup. - Returns a Persona struct carrying - style_affinities, abv_range, - ibu_preference, checkin_weight. - end note - - :NamesByCountry::SampleName(\n location.iso3166_1); - note right - Deterministic lookup — no LLM involved. - Name selected from pre-keyed table - and passed into the generation prompt. - end note - - :GenerateUser(location, persona, sampled_name)\nvia DataGenerator; - note right - LLM receives: Location fields + persona - description + sampled name. Generates - bio and preference signals grounded - in locale and persona. - end note - - :PipelineLogger::Log(Info, UserGeneration,\n city, user_id, "llm"); - :Send GeneratedUser → llm_ch; - endwhile (no) - :Close llm_ch; -fork again - |SQLite Worker| - while (llm_ch has items?) is (yes) - :Receive GeneratedUser; - :ProcessUser(user) → sqlite3_int64; - :PipelineLogger::Log(Info, UserGeneration,\n city, user_id, "sqlite"); - :Append → user_pool_; - endwhile (no) -end fork - -|Orchestrator| -:Join LLM Worker, SQLite Worker; - -' ═══════════════════════════════════════════ -' PHASE 1 — BREWERY & BEER GENERATION -' ═══════════════════════════════════════════ -:RunBreweryAndBeerPhase(sampled_locations); -:Create BoundedChannels\n(loc_ch, llm_ch, exp_ch); - -fork - |Orchestrator| - :Loop: Send Locations → loc_ch; - :Close loc_ch; -fork again - |Enrichment Workers (xN)| - while (loc_ch has items?) is (yes) - :Receive Location; - :GetLocationContext(location,\nBreweryContextStrategy); - :PipelineLogger::Log(Info,\n BreweryAndBeerGeneration,\n city, nullopt, "enrichment"); - :Send EnrichedCity → llm_ch; - endwhile (no) - |Orchestrator| - :Join Enrichment Workers; - :Close llm_ch; -fork again - |LLM Worker| - while (llm_ch has items?) is (yes) - :Receive EnrichedCity; - - :GenerateBrewery(location, context)\nvia DataGenerator; - - :IBeerSelectionStrategy::SelectStyles(\n brewery, beer_style_palette_); - - while (For each selected BeerStyle?) is (remaining) - :GetStyleContextFromCache(style); - note right - Guaranteed cache hit from startup. - end note - :GenerateBeer(brewery, style_context)\nvia DataGenerator; - :Attach GeneratedBeer to Brewery bundle; - endwhile (done) - - :PipelineLogger::Log(Info,\n BreweryAndBeerGeneration,\n city, brewery_id, "llm"); - :Send BreweryWithBeers Bundle → exp_ch; - endwhile (no) - :Close exp_ch; -fork again - |SQLite Worker| - while (exp_ch has items?) is (yes) - :Receive BreweryWithBeers Bundle; - :ProcessBrewery(brewery) → brewery_id; - :Append → brewery_pool_; - - while (For each beer in bundle?) is (remaining) - :Set beer.brewery_id = brewery_id; - :ProcessBeer(beer) → sqlite3_int64; - :Append → beer_pool_; - endwhile (done) - - :PipelineLogger::Log(Info,\n BreweryAndBeerGeneration,\n city, brewery_id, "sqlite"); - endwhile (no) -end fork - -|Orchestrator| -:Join LLM Worker, SQLite Worker; -note right - Both brewery_pool_ and beer_pool_ - are now completely populated. -end note - -' ═══════════════════════════════════════════ -' PHASE 2 — CHECKIN GENERATION -' ═══════════════════════════════════════════ -:RunCheckinPhase(); -:ICheckinDistributionStrategy::\nAssignActivityWeights(user_pool_); -note right - Weights seeded from each user's - persona.checkin_weight. J-curve profile - emerges from persona distribution. -end note - -while (For each GeneratedUser in user_pool_?) is (remaining) - :CheckinsForUser(user, brewery_pool_.size()); - while (For each checkin index?) is (remaining) - :TimestampFor(user, index); - :Select brewery from brewery_pool_; - :GenerateCheckin(user, brewery, timestamp)\nvia DataGenerator; - :ProcessCheckin(checkin) → sqlite3_int64; - :PipelineLogger::Log(Info, CheckinGeneration,\n nullopt, checkin_id, "sqlite"); - :Append → checkin_pool_; - endwhile (done) -endwhile (done) - -' ═══════════════════════════════════════════ -' PHASE 3 — RATING GENERATION -' ═══════════════════════════════════════════ -:RunRatingPhase(); -note right - Beer selection biased by - user.persona.style_affinities and abv_range. - Rating skew modulated per persona. -end note - -while (For each GeneratedCheckin in checkin_pool_?) is (remaining) - :Match brewery_id → select beer from beer_pool_\n(same brewery_id, biased by persona affinities); - if (Beer exists for brewery?) then (yes) - :GenerateRating(user, beer, checkin_id)\nvia DataGenerator; - :ProcessRating(rating); - :PipelineLogger::Log(Info, RatingGeneration,\n nullopt, rating_id, "sqlite"); - else (no) - :PipelineLogger::Log(Warn, RatingGeneration,\n nullopt, brewery_id, "sqlite"); - :Skip — brewery has no beers; - endif -endwhile (done) - -' ═══════════════════════════════════════════ -' TEARDOWN -' ═══════════════════════════════════════════ -|Main| -:Finalize SqliteExportService; -note right - COMMIT covers all five fixture types. -end note -:Close log_ch; -:Join Log Worker; -note right - Drain guarantees no LogEntry is - dropped at shutdown. -end note -:spdlog::info "Pipeline complete in X ms"; -stop - -@enduml diff --git a/pipeline/diagrams/future_possible_activity.svg b/pipeline/diagrams/future_possible_activity.svg deleted file mode 100644 index 7676296..0000000 --- a/pipeline/diagrams/future_possible_activity.svg +++ /dev/null @@ -1 +0,0 @@ -The Biergarten Data Pipeline — Activity DiagramThe Biergarten Data Pipeline — Activity DiagramParseArguments(argc, argv)spdlog::erroryesInvalid args?noInit CurlGlobalState & LlamaBackendStateBuild DI injectorJsonLoader::LoadLocations("locations.json")JsonLoader::LoadBeerStyles("beer-styles.json")NEW: Beer styles do not need location context.Wikipedia summaries for the entire palette arefetched and cached globally at startup.EnrichmentService::PreWarmBeerStyleCache(beer_styles)Opens SQLite connection.Begins a single transactioncovering all five fixture types.Initialize SqliteExportServiceBiergartenPipelineOrchestrator::Run()COMMIT covers all five fixture types.Finalize SqliteExportServicespdlog::info "Pipeline complete in X ms"RunUserPhase(sampled_locations)Create BoundedChannels(user_llm_ch, user_exp_ch)Loop: Send Locations → user_llm_chClose user_llm_chJoin LLM Worker, SQLite WorkerRunBreweryAndBeerPhase(sampled_locations)Create BoundedChannels(loc_ch, llm_ch, exp_ch)Loop: Send Locations → loc_chClose loc_chJoin Enrichment WorkersClose llm_chBoth brewery_pool_ and beer_pool_are now completely populated.Join LLM Worker, SQLite WorkerRunCheckinPhase()ICheckinDistributionStrategy::AssignActivityWeights(user_pool_)CheckinsForUser(user, brewery_pool_.size())TimestampFor(user, index)Select brewery from brewery_pool_GenerateCheckin(user, brewery, timestamp)via DataGeneratorProcessCheckin(checkin) → sqlite3_int64Append → checkin_pool_remainingFor each checkin index?doneremainingFor each GeneratedUser in user_pool_?doneRunRatingPhase()Match brewery_id → select beerfrom beer_pool_ (same brewery_id)Beer exists for brewery?yesnoGenerateRating(user, beer, checkin_id)via DataGeneratorProcessRating(rating)Skip — brewery has no beersremainingFor each GeneratedCheckin in checkin_pool_?doneReceive LocationGenerateUser(location)via DataGeneratorSend GeneratedUser → user_exp_chyesuser_llm_ch has items?noClose user_exp_chReceive EnrichedCityGenerateBrewery(location, context)via DataGeneratorIBeerSelectionStrategy::SelectStyles(brewery, beer_style_palette_)Guaranteed cache hit from startup.GetStyleContextFromCache(style)GenerateBeer(brewery, style_context)via DataGeneratorAttach GeneratedBeer to Brewery bundleremainingFor each selected BeerStyle?doneThe next generation of a brewery isentirely dependent on the currentbrewery and its beers completing.Send BreweryWithBeers Bundle → exp_chyesllm_ch has items?noClose exp_chReceive GeneratedUserProcessUser(user) → sqlite3_int64Append → user_pool_yesuser_exp_ch has items?noReceive BreweryWithBeers BundleProcessBrewery(brewery) → brewery_idAppend → brewery_pool_Set beer.brewery_id = brewery_idProcessBeer(beer) → sqlite3_int64Append → beer_pool_remainingFor each beer in bundle?doneyesexp_ch has items?noReceive LocationGetLocationContext(location,BreweryContextStrategy)Send EnrichedCity → llm_chyesloc_ch has items?noMainOrchestratorLLM WorkerSQLite WorkerEnrichment Workers (xN) \ No newline at end of file diff --git a/pipeline/diagrams/future_possible_architecture.svg b/pipeline/diagrams/future_possible_architecture.svg deleted file mode 100644 index bf052f3..0000000 --- a/pipeline/diagrams/future_possible_architecture.svg +++ /dev/null @@ -1 +0,0 @@ -The Biergarten Data Pipeline — ArchitectureThe Biergarten Data Pipeline — ArchitectureDomain: Value Objects & ContractsDomain PolicyInfrastructure: EnrichmentInfrastructure: GenerationInfrastructure: Pipeline ChannelInfrastructure: ExportOrchestrationLocationcity : std::stringstate_province : std::stringiso3166_2 : std::stringcountry : std::stringiso3166_1 : std::stringlocal_languages : std::vector<std::string>latitude : doublelongitude : doubleLocationContexttext : std::stringcompleteness : Completenesschar_count : size_t«enum» CompletenessFullPartialAbsentEnrichedCitylocation : Locationcontext : LocationContextBeerStylename : std::stringdescription : std::stringmin_abv : floatmax_abv : floatmin_ibu : intmax_ibu : intLoaded once at startup frombeer-styles.json via JsonLoader.Passed as std::span<const BeerStyle>to IBeerSelectionStrategy.Generator receives the selectedstyle as a parameter — it neverreads the palette directly.BreweryResultname_en : std::stringdescription_en : std::stringname_local : std::stringdescription_local : std::stringBeerResultname_en : std::stringdescription_en : std::stringname_local : std::stringdescription_local : std::stringstyle : std::stringabv : floatibu : intUserResultusername : std::stringbio : std::stringactivity_weight : floatactivity_weight assigned byICheckinDistributionStrategyafter the full user pool iscommitted. Drives J-curvecheckin volume per user.CheckinResultchecked_in_at : std::stringnote : std::stringRatingResultscore : floatnote : std::stringGeneratedBrewerybrewery_id : sqlite3_int64location : Locationbrewery : BreweryResultcontext_completeness : LocationContext::Completenessgenerated_at : std::stringGeneratedBeerbeer_id : sqlite3_int64brewery_id : sqlite3_int64location : Locationbeer : BeerResultgenerated_at : std::stringGeneratedUseruser_id : sqlite3_int64location : Locationuser : UserResultgenerated_at : std::stringuser_id populated after SQLiteinsert. Live FK carried in poolfor checkin and rating references.GeneratedCheckincheckin_id : sqlite3_int64user_id : sqlite3_int64brewery_id : sqlite3_int64checkin : CheckinResultgenerated_at : std::stringGeneratedRatinguser_id : sqlite3_int64beer_id : sqlite3_int64checkin_id : sqlite3_int64rating : RatingResultgenerated_at : std::string«interface»IContextStrategyQueriesFor(loc : const Location&) : std::vector<std::string>MaxContextChars() : size_tBreweryContextStrategyQueriesFor(loc : const Location&) : std::vector<std::string>MaxContextChars() : size_tBeerContextStrategyQueriesFor(loc : const Location&) : std::vector<std::string>MaxContextChars() : size_t«interface»ISamplingStrategySample(locations : const std::vector<Location>&) : std::vector<Location>UniformSamplingStrategysample_size_ : size_tSample(locations : const std::vector<Location>&) : std::vector<Location>«interface»IBeerSelectionStrategySelectStyles(brewery : const GeneratedBrewery&,palette : std::span<const BeerStyle>) : std::vector<BeerStyle>Decides how many beers a brewerygets and which styles are selected.Count distribution and stylededuplication logic live here,not in the orchestrator or generator.RandomBeerSelectionStrategyrng_ : std::mt19937min_beers_ : size_tmax_beers_ : size_tSelectStyles(brewery : const GeneratedBrewery&,palette : std::span<const BeerStyle>) : std::vector<BeerStyle>Draws a random count in [min, max].Samples without replacement frompalette to avoid duplicate stylesper brewery.«interface»ICheckinDistributionStrategyAssignActivityWeights(users : std::vector<GeneratedUser>&) : voidCheckinsForUser(user : const GeneratedUser&,brewery_count : size_t) : size_tTimestampFor(user : const GeneratedUser&,index : size_t) : std::stringOwns all statistical policy:J-curve weight assignment,bursty weekend timestamps,per-user checkin volume.JCurveCheckinStrategyrng_ : std::mt19937AssignActivityWeights(users : std::vector<GeneratedUser>&) : voidCheckinsForUser(user : const GeneratedUser&,brewery_count : size_t) : size_tTimestampFor(user : const GeneratedUser&,index : size_t) : std::string«interface»IEnrichmentServiceGetLocationContext(loc : const Location&,strategy : const IContextStrategy&) : LocationContextWikipediaServiceclient_ : std::unique_ptr<WebClient>extract_cache_ : std::unordered_map<std::string, std::string>GetLocationContext(loc : const Location&,strategy : const IContextStrategy&) : LocationContextFetchExtract(query : std::string_view) : std::stringextract_cache_ keyed by query string.Beer pass gets near-100% cache hitssince locations were already fetchedduring the brewery pass.«interface»WebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::stringCURLWebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::string«interface»DataGeneratorGenerateBrewery(location : const Location&,context : const LocationContext&) : BreweryResultGenerateBeer(brewery_id : sqlite3_int64,location : const Location&,context : const LocationContext&,style : const BeerStyle&) : BeerResultGenerateUser(location : const Location&) : UserResultGenerateCheckin(user : const GeneratedUser&,brewery : const GeneratedBrewery&,timestamp : const std::string&) : CheckinResultGenerateRating(user : const GeneratedUser&,beer : const GeneratedBeer&,checkin_id : sqlite3_int64) : RatingResultGenerateBeer receives BeerStyleas a parameter. Style selectionand count decisions live inIBeerSelectionStrategy, not here.MockGeneratorGenerateBrewery(...) : BreweryResultGenerateBeer(...) : BeerResultGenerateUser(...) : UserResultGenerateCheckin(...) : CheckinResultGenerateRating(...) : RatingResultDeterministicHash(location : const Location&) : size_tLlamaGeneratormodel_ : ModelHandlecontext_ : ContextHandleprompt_formatter_ : std::unique_ptr<IPromptFormatter>config_ : LlamaConfigrng_ : std::mt19937GenerateBrewery(...) : BreweryResultGenerateBeer(...) : BeerResultGenerateUser(...) : UserResultGenerateCheckin(...) : CheckinResultGenerateRating(...) : RatingResultLoad(config : const LlamaConfig&) : voidInfer(system_prompt, user_prompt,max_tokens, grammar) : std::stringValidateModelArchitecture() : void«interface»IPromptFormatterFormat(system_prompt : std::string_view,user_prompt : std::string_view) : std::stringExpectedArchitecture() : std::string_viewGemma4JinjaPromptFormatterFormat(...) : std::stringExpectedArchitecture() : std::string_viewLlamaConfigmodel_path : std::stringtemperature : floattop_p : floattop_k : uint32_tn_ctx : uint32_tseed : intBoundedChannelTqueue_ : std::queue<T>mutex_ : std::mutexnot_full_ : std::condition_variablenot_empty_ : std::condition_variablecapacity_ : size_tclosed_ : boolSend(item : T) : voidReceive() : std::optional<T>Close() : voidUsed for user, brewery, andcheckin/rating phases.Beer phase uses a simplesequential loop — enrichmentis all cache hits, no fan-outneeded.«interface»IExportServiceInitialize() : voidProcessBrewery(brewery : const GeneratedBrewery&) : sqlite3_int64ProcessBeer(beer : const GeneratedBeer&) : sqlite3_int64ProcessUser(user : const GeneratedUser&) : sqlite3_int64ProcessCheckin(checkin : const GeneratedCheckin&) : sqlite3_int64ProcessRating(rating : const GeneratedRating&) : voidFinalize() : voidSqliteExportServicedate_time_provider_ : std::unique_ptr<IDateTimeProvider>db_handle_ : SqliteDatabaseHandleinsert_location_stmt_ : SqliteStatementHandleinsert_brewery_stmt_ : SqliteStatementHandleinsert_beer_stmt_ : SqliteStatementHandleinsert_user_stmt_ : SqliteStatementHandleinsert_checkin_stmt_ : SqliteStatementHandleinsert_rating_stmt_ : SqliteStatementHandletransaction_open_ : boollocation_cache_ : std::unordered_map<std::string, sqlite3_int64>brewery_cache_ : std::unordered_map<std::string, sqlite3_int64>Initialize() : voidProcessBrewery(brewery : const GeneratedBrewery&) : sqlite3_int64ProcessBeer(beer : const GeneratedBeer&) : sqlite3_int64ProcessUser(user : const GeneratedUser&) : sqlite3_int64ProcessCheckin(checkin : const GeneratedCheckin&) : sqlite3_int64ProcessRating(rating : const GeneratedRating&) : voidFinalize() : voidInitializeSchema() : voidPrepareStatements() : voidRollbackAndCloseNoThrow() : voidFinalizeStatements() : voidbrewery_cache_ restored.Keyed by location string forlocation deduplication, andby brewery identity for beerFK resolution without re-querying.«interface»IDateTimeProviderGetUtcTimestamp() : std::stringSystemDateTimeProviderGetUtcTimestamp() : std::stringBiergartenPipelineOrchestratorenrichment_service_ : std::unique_ptr<IEnrichmentService>generator_ : std::unique_ptr<DataGenerator>exporter_ : std::unique_ptr<IExportService>brewery_context_strategy_ : std::unique_ptr<IContextStrategy>beer_context_strategy_ : std::unique_ptr<IContextStrategy>sampling_strategy_ : std::unique_ptr<ISamplingStrategy>beer_selection_strategy_ : std::unique_ptr<IBeerSelectionStrategy>checkin_strategy_ : std::unique_ptr<ICheckinDistributionStrategy>beer_style_palette_ : std::vector<BeerStyle>user_pool_ : std::vector<GeneratedUser>brewery_pool_ : std::vector<GeneratedBrewery>beer_pool_ : std::vector<GeneratedBeer>checkin_pool_ : std::vector<GeneratedCheckin>Run() : boolRunUserPhase(locations : const std::vector<Location>&) : voidRunBreweryPhase(locations : const std::vector<Location>&) : voidRunBeerPhase() : voidRunCheckinPhase() : voidRunRatingPhase() : voidbeer_style_palette_ loaded onceat startup from beer-styles.json.Passed as std::span<const BeerStyle>to IBeerSelectionStrategy per brewery.RunBeerPhase() is a sequential loop —no channels, no fan-out. Enrichmentis cache hits; LLM is the only cost.JsonLoaderLoadLocations(filepath : const std::filesystem::path&) : std::vector<Location>LoadBeerStyles(filepath : const std::filesystem::path&) : std::vector<BeerStyle>LoadBeerStyles() added.Reads beer-styles.json onceat startup into the paletteheld by the orchestrator.ownsownsownsownsownsownsusesimplementsimplementsimplementsimplementsimplementsimplementsownsuses (parameter)implementsimplementsimplementsownsconstructed withimplementsimplementsownsimplementscontainscontainscontainscontainscontainscontainscontainscontainscontainscontains \ No newline at end of file diff --git a/pipeline/diagrams/planned/activity.puml b/pipeline/diagrams/planned/activity.puml new file mode 100644 index 0000000..6f92560 --- /dev/null +++ b/pipeline/diagrams/planned/activity.puml @@ -0,0 +1,360 @@ +@startuml biergarten_activity +!include ../biergarten-weizen-theme.puml +skinparam defaultFontSize 13 +skinparam titleFontSize 20 + +title The Biergarten Data Pipeline — Activity Diagram + +|Main| +start +:ParseArguments(argc, argv); +if (Invalid args?) then (yes) + :spdlog::error; + stop +else (no) +endif +:Init CurlGlobalState & LlamaBackendState; +:Build DI injector; + +:Initialize SqliteExportService; +note right + Opens SQLite connection. + (Transactions are now managed + per-phase via batching). +end note + +:Create BoundedChannel log_ch; +:Spawn Log Worker thread; +note right + Log worker drains log_ch for the + entire pipeline lifetime. + All workers emit LogEntry structs + via PipelineLogger -- never spdlog directly. +end note + +:BiergartenPipelineOrchestrator::Run(); +|BiergartenPipelineOrchestrator::Run()| + +fork + :JsonLoader::LoadBeerStyles("beer-styles.json"); + :EnrichmentService::PreWarmBeerStyleCache(beer_styles); +fork again + :JsonLoader::LoadLocations("locations.json"); + :EnrichmentService::PreWarmLocationCache(sampled_locations); +end fork +fork + :JsonLoader::LoadNamesByCountry("names-by-country.json"); +fork again + :JsonLoader::LoadPersonas("personas.json"); +end fork + +' ═══════════════════════════════════════════ +' PHASE 0 — USER GENERATION +' ═══════════════════════════════════════════ +|Orchestrator| +:RunUserPhase(sampled_locations); +:Create BoundedChannels\n(loc_ch, exp_ch); + +fork + |Orchestrator| + :Loop: Send Locations -> loc_ch; + :Close loc_ch; + note right + Producer closes loc_ch. + LLM Worker while loop + terminates on empty + closed. + end note +fork again + |LLM Worker| + while (loc_ch has items?) is (yes) + :Receive Location; + + :GetLocationContextFromCache(location); + note right + Guaranteed cache hit from startup. + end note + + :IPersonaSelectionStrategy::SelectPersona(\n personas_palette_); + note right + Guaranteed cache hit from startup. + Returns a Persona struct carrying + style_affinities, abv_range, + ibu_preference, checkin_weight. + end note + + :NamesByCountry::SampleName(\n location.iso3166_1); + note right + Deterministic lookup -- no LLM involved. + Name selected from pre-keyed table + and passed into the generation prompt. + end note + + :GenerateUser(enriched_city, persona, sampled_name)\nvia DataGenerator; + note right + LLM receives: EnrichedCity context + persona + description + sampled name. Generates + bio and preference signals grounded + in locale and persona. + end note + + :PipelineLogger::Log(Info, UserGeneration,\n city, user_id, "llm"); + :Send GeneratedUser -> exp_ch; + endwhile (no) + :Close exp_ch; + note right + Producer closes exp_ch. + SQLite Worker while loop + terminates on empty + closed. + end note +fork again + |SQLite Worker| + :BEGIN TRANSACTION; + while (exp_ch has items?) is (yes) + :Receive GeneratedUser; + :ProcessUser(user); + :PipelineLogger::Log(Info, UserGeneration,\n city, user_id, "sqlite"); + :Append -> user_pool_; + if (Batch size reached?) then (yes) + :COMMIT & BEGIN; + else (no) + endif + endwhile (no) + :COMMIT (Final); +end fork + +|Orchestrator| +:Join LLM Worker, SQLite Worker; + +' ═══════════════════════════════════════════ +' PHASE 1a — BREWERY GENERATION +' ═══════════════════════════════════════════ +:RunBreweryPhase(sampled_locations); +:Create BoundedChannels\n(loc_ch, exp_ch); + +fork + |Orchestrator| + :Loop: Sample User from user_pool_ + and pair with Location; + :Send BreweryTask(Location, User) -> loc_ch; + :Close loc_ch; +fork again + |LLM Worker| + while (loc_ch has items?) is (yes) + :Receive BreweryTask(Location, User); + + :GetLocationContextFromCache(task.location); + note right + Guaranteed cache hit from startup. + end note + + :GenerateBrewery(enriched_city, context, task.user)\nvia DataGenerator; + note right + KV cache stays warm. + Brewery is linked to the sampled owner_user_id. + end note + :PipelineLogger::Log(Info,\n BreweryGeneration,\n city, brewery_id, "llm"); + :Send GeneratedBrewery -> exp_ch; + endwhile (no) + :Close exp_ch; +fork again + |SQLite Worker| + :BEGIN TRANSACTION; + while (exp_ch has items?) is (yes) + :Receive GeneratedBrewery; + :ProcessBrewery(brewery); + :PipelineLogger::Log(Info,\n BreweryGeneration,\n city, brewery_id, "sqlite"); + :Append -> brewery_pool_; + if (Batch size reached?) then (yes) + :COMMIT & BEGIN; + else (no) + endif + endwhile (no) + :COMMIT (Final); +end fork + +|Orchestrator| +:Join LLM Worker, SQLite Worker; +note right + brewery_pool_ is now fully populated. + Phase 1b may begin. +end note + +' ═══════════════════════════════════════════ +' PHASE 1b — BEER GENERATION +' ═══════════════════════════════════════════ +:RunBeerPhase(); +:Create BoundedChannels\n(brew_ch, exp_ch); + +fork + |Orchestrator| + :Loop: Send Breweries -> brew_ch; + :Close brew_ch; +fork again + |LLM Worker| + while (brew_ch has items?) is (yes) + :Receive GeneratedBrewery; + :IBeerSelectionStrategy::SelectStyles(\n brewery, beer_style_palette_); + + while (For each selected BeerStyle?) is (remaining) + :GetStyleContextFromCache(style); + note right + Guaranteed cache hit from startup. + KV cache stays warm across all + beer generations -- system prompt + does not change within this phase. + end note + :GenerateBeer(brewery, style_context)\nvia DataGenerator; + :Attach GeneratedBeer to bundle; + endwhile (done) + + :PipelineLogger::Log(Info,\n BeerGeneration,\n city, brewery_id, "llm"); + :Send BeersBundle -> exp_ch; + endwhile (no) + :Close exp_ch; +fork again + |SQLite Worker| + :BEGIN TRANSACTION; + while (exp_ch has items?) is (yes) + :Receive BeersBundle; + while (For each beer in bundle?) is (remaining) + :Set beer.brewery_id from bundle; + :ProcessBeer(beer); + :Append -> beer_pool_; + endwhile (done) + :PipelineLogger::Log(Info,\n BeerGeneration,\n city, brewery_id, "sqlite"); + if (Batch size reached?) then (yes) + :COMMIT & BEGIN; + else (no) + endif + endwhile (no) + :COMMIT (Final); +end fork + +|Orchestrator| +:Join LLM Worker, SQLite Worker; +note right + Both brewery_pool_ and beer_pool_ + are now completely populated. + Checkin and Follow phases may + now run in parallel. +end note + +' ═══════════════════════════════════════════ +' PHASE 2 — CHECKIN + FOLLOW GENERATION +' (parallel — both depend only on user_pool_ +' and brewery_pool_ being fully populated) +' ═══════════════════════════════════════════ +fork + |Orchestrator| + :RunCheckinPhase(); + :ICheckinDistributionStrategy::\nAssignActivityWeights(user_pool_); + note right + Weights seeded from each user's + persona.checkin_weight. J-curve profile + emerges from persona distribution. + end note + + :BEGIN TRANSACTION; + while (For each GeneratedUser in user_pool_?) is (remaining) + :CheckinsForUser(user, brewery_pool_.size()); + while (For each checkin index?) is (remaining) + :TimestampFor(user, index); + :Select brewery from brewery_pool_; + :GenerateCheckin(user, brewery, timestamp)\nvia DataGenerator; + :ProcessCheckin(checkin); + :PipelineLogger::Log(Info, CheckinGeneration,\n nullopt, checkin_id, "sqlite"); + :Append -> checkin_pool_; + if (Batch size reached?) then (yes) + :COMMIT & BEGIN; + else (no) + endif + endwhile (done) + endwhile (done) + :COMMIT (Final); + +fork again + |Orchestrator| + :RunFollowPhase(); + :IFollowGenerationStrategy::\nAssignFollowWeights(user_pool_); + note right + For RandomFollowStrategy, weights + are uniform. For ActivityWeightedFollowStrategy, + weights derived from user.activity_weight + so high-activity users attract more followers. + end note + + :BEGIN TRANSACTION; + :IFollowGenerationStrategy::\nGenerateFollows(user_pool_); + note right + Self-follow constraint (follower_id != followed_id) + enforced here and at the DB schema level. + end note + while (For each GeneratedFollow?) is (remaining) + :ProcessFollow(follow); + :PipelineLogger::Log(Info, FollowGeneration,\n nullopt, follower_id, "sqlite"); + :Append -> follow_pool_; + if (Batch size reached?) then (yes) + :COMMIT & BEGIN; + else (no) + endif + endwhile (done) + :COMMIT (Final); + +end fork + +|Orchestrator| +:Join CheckinPhase, FollowPhase; +note right + checkin_pool_ and follow_pool_ + are now fully populated. + Rating phase may begin. +end note + +' ═══════════════════════════════════════════ +' PHASE 3 — RATING GENERATION +' ═══════════════════════════════════════════ +:RunRatingPhase(); +note right + Beer selection biased by + user.persona.style_affinities and abv_range. + Rating skew modulated per persona. +end note + +:BEGIN TRANSACTION; +while (For each GeneratedCheckin in checkin_pool_?) is (remaining) + :Match brewery_id, select beer from beer_pool_\n(same brewery_id, biased by persona affinities); + if (Beer exists for brewery?) then (yes) + :GenerateRating(user, beer, checkin_id)\nvia DataGenerator; + :ProcessRating(rating); + :PipelineLogger::Log(Info, RatingGeneration,\n nullopt, rating_id, "sqlite"); + if (Batch size reached?) then (yes) + :COMMIT & BEGIN; + else (no) + endif + else (no) + :PipelineLogger::Log(Warn, RatingGeneration,\n nullopt, brewery_id, "sqlite"); + :Skip -- brewery has no beers; + endif +endwhile (done) +:COMMIT (Final); + +' ═══════════════════════════════════════════ +' TEARDOWN +' ═══════════════════════════════════════════ +|Orchestrator| +:Finalize SqliteExportService; +note right + Safely closes the DB connection. +end note +:Close log_ch; + +|Main| +:spdlog::info "Pipeline complete in X ms"; +:Join Log Worker; +note right + Drain guarantees no LogEntry is + dropped at shutdown. +end note +stop + +@enduml diff --git a/pipeline/diagrams/future-class-diagram.puml b/pipeline/diagrams/planned/class.puml similarity index 71% rename from pipeline/diagrams/future-class-diagram.puml rename to pipeline/diagrams/planned/class.puml index 1716c82..3de4ae5 100644 --- a/pipeline/diagrams/future-class-diagram.puml +++ b/pipeline/diagrams/planned/class.puml @@ -1,51 +1,14 @@ -@startuml future_possible_architecture +@startuml ' ========================================== ' CONFIGURATION & STYLING ' ========================================== -left to right direction -skinparam linetype ortho +!include ../biergarten-weizen-theme.puml +skinparam classAttributeFontSize 9 +skinparam defaultFontSize 25 +skinparam titleFontSize 30 -' --- Typography --- -skinparam defaultFontName "DM Sans" -skinparam defaultFontSize 14 -skinparam titleFontName "Volkhov" -skinparam titleFontSize 20 - -' --- Global Colors --- -skinparam backgroundColor #FCFCF7 -skinparam defaultFontColor #14180C -skinparam titleFontColor #14180C -skinparam ArrowColor #656F33 - -skinparam class { - BackgroundColor #EBECE3 - HeaderBackgroundColor #CBD2B5 - BorderColor #4A5837 - ArrowColor #656F33 - FontColor #14180C -} - -skinparam package { - BackgroundColor #DBEEDD - BorderColor #4A5837 - FontColor #14180C -} - -skinparam note { - BackgroundColor #DBEEDD - BorderColor #4A5837 - FontColor #14180C -} - -skinparam monochrome reverse - -title The Biergarten Data Pipeline — Planned Architecture - -' ========================================== -' DOMAIN MODELS -' ========================================== -package "Domain Models" { +package "Domain: Models" { class Location { + city : std::string @@ -62,8 +25,9 @@ package "Domain Models" { + text : std::string + completeness : Completeness + char_count : size_t - -- - <> Completeness + } + + enum Completeness { Full Partial Absent @@ -116,46 +80,69 @@ package "Domain Models" { + note : std::string } + class GenerationMetadata { + + generation_id : uint64_t + + generated_time : std::string + + context_provided : bool + + generated_with : std::string + } + class GeneratedBrewery { - + brewery_id : sqlite3_int64 + + brewery_id : uint64_t + location : Location + brewery : BreweryResult + context_completeness : LocationContext::Completeness - + generated_at : std::string + + metadata : GenerationMetadata } class GeneratedBeer { - + beer_id : sqlite3_int64 - + brewery_id : sqlite3_int64 + + beer_id : uint64_t + + brewery_id : uint64_t + location : Location + style : BeerStyle + beer : BeerResult - + generated_at : std::string + + metadata : GenerationMetadata } class GeneratedUser { - + user_id : sqlite3_int64 + + user_id : uint64_t + location : Location + user : UserResult - + generated_at : std::string + + metadata : GenerationMetadata } class GeneratedCheckin { - + checkin_id : sqlite3_int64 - + user_id : sqlite3_int64 - + brewery_id : sqlite3_int64 + + checkin_id : uint64_t + + user_id : uint64_t + + brewery_id : uint64_t + checkin : CheckinResult - + generated_at : std::string + + metadata : GenerationMetadata } class GeneratedRating { - + user_id : sqlite3_int64 - + beer_id : sqlite3_int64 - + checkin_id : sqlite3_int64 + + user_id : uint64_t + + beer_id : uint64_t + + checkin_id : uint64_t + rating : RatingResult - + generated_at : std::string + + metadata : GenerationMetadata } + class GeneratedFollow { + + follower_id : uint64_t + + followed_id : uint64_t + + metadata : GenerationMetadata + } + + class UserPersona { + + name: std::string + + description: std::string + + style_affinities: std::vector + } + + LocationContext *-- Completeness +} + +package "Domain: Application Configuration"{ class SamplingOptions { + temperature : float = 1.0F + top_p : float = 0.95F @@ -184,70 +171,9 @@ package "Domain Models" { ApplicationOptions *-- GeneratorOptions ApplicationOptions *-- PipelineOptions GeneratorOptions *-- SamplingOptions - LocationContext *-- Completeness } - -' ========================================== -' LOGGING -' ========================================== -package "Logging" { - - enum LogLevel { - Debug - Info - Warn - Error - } - - enum PipelinePhase { - Startup - UserGeneration - BreweryAndBeerGeneration - CheckinGeneration - RatingGeneration - Teardown - } - - class LogEntry { - + timestamp : std::chrono::system_clock::time_point - + level : LogLevel - + phase : PipelinePhase - + message : std::string - + city : std::optional - + entity_id : std::optional - + worker : std::optional - } - - interface Logger <> { - + Log(level, phase, message,\n city, entity_id, worker) : void - } - - class PipelineLogger { - - log_ch_ : BoundedChannel& - + Log(level, phase, message,\n city, entity_id, worker) : void - } - - class LogWorker { - - log_ch_ : BoundedChannel& - + Run() : void - - FormatTimestamp(tp) : std::string - - ToSpdlogLevel(level) : spdlog::level::level_enum - - ToString(phase) : std::string - } - - ' --- Logging Relationships --- - LogEntry *-- LogLevel - LogEntry *-- PipelinePhase - PipelineLogger ..> LogEntry : emits - LogWorker ..> LogEntry : consumes -} - - -' ========================================== -' DOMAIN POLICY -' ========================================== -package "Domain Policy" { +package "Domain: Policy" { interface ContextStrategy <> { + QueriesFor(loc : const Location&) : std::vector @@ -297,13 +223,103 @@ package "Domain Policy" { + TimestampFor(user : const GeneratedUser&,\n index : size_t) : std::string } + class RandomCheckinStrategy { + - rng_ : std::mt19937 + - min_checkins_ : size_t + - max_checkins_ : size_t + + AssignActivityWeights(users : std::vector&) : void + + CheckinsForUser(user : const GeneratedUser&,\n brewery_count : size_t) : size_t + + TimestampFor(user : const GeneratedUser&,\n index : size_t) : std::string + } + + interface FollowGenerationStrategy <> { + + GenerateFollows(users : const std::vector&) : std::vector + } + + class RandomFollowStrategy { + - rng_ : std::mt19937 + - min_follows_ : size_t + - max_follows_ : size_t + + GenerateFollows(users : const std::vector&) : std::vector + } + + class ActivityWeightedFollowStrategy { + - rng_ : std::mt19937 + - min_follows_ : size_t + - max_follows_ : size_t + + GenerateFollows(users : const std::vector&) : std::vector + } } +package "Infrastructure: Logging" { + enum LogLevel { + Debug + Info + Warn + Error + } -' ========================================== -' ORCHESTRATION -' ========================================== -package "Orchestration" { + enum PipelinePhase { + Startup + UserGeneration + BreweryAndBeerGeneration + CheckinGeneration + RatingGeneration + FollowGeneration + Teardown + } + + class LogEntry { + + timestamp : std::chrono::system_clock::time_point + + level : LogLevel + + phase : PipelinePhase + + message : std::string + + city : std::optional + + entity_id : std::optional + + worker : std::optional + } + + interface Logger <> { + + Log(level, phase, message,\n city, entity_id, worker) : void + } + + class PipelineLogger { + - log_ch_ : BoundedChannel& + + Log(level, phase, message,\n city, entity_id, worker) : void + } + + class LogWorker { + - log_ch_ : BoundedChannel& + + Run() : void + - FormatTimestamp(tp) : std::string + - ToSpdlogLevel(level) : spdlog::level::level_enum + - ToString(phase) : std::string + } + + ' --- Logging Relationships --- + LogEntry *-- LogLevel + LogEntry *-- PipelinePhase + PipelineLogger ..> LogEntry : emits + LogWorker ..> LogEntry : consumes +} + +package "Infrastructure: Pipeline Channel" { + + class "BoundedChannel" as BoundedChannel { + - queue_ : std::queue + - mutex_ : std::mutex + - not_full_ : std::condition_variable + - not_empty_ : std::condition_variable + - capacity_ : size_t + - closed_ : bool + + Send(item : T) : void + + Receive() : std::optional + + Close() : void + } + +} + +package "Infrastructure: Data Preloading" { interface DataPreloader <> { + LoadLocations(filepath : const std::filesystem::path&) : std::vector @@ -312,38 +328,6 @@ package "Orchestration" { + LoadNamesByCountry(filepath : const std::filesystem::path&) : NamesByCountry } - class BiergartenPipelineOrchestrator { - - preloader_ : std::unique_ptr - - enrichment_service_ : std::unique_ptr - - generator_ : std::unique_ptr - - logger_ : std::unique_ptr - - exporter_ : std::unique_ptr - - brewery_context_strategy_ : std::unique_ptr - - sampling_strategy_ : std::unique_ptr - - beer_selection_strategy_ : std::unique_ptr - - checkin_strategy_ : std::unique_ptr - - beer_style_palette_ : std::vector - - options_ : ApplicationOptions - -- - - user_pool_ : std::vector - - brewery_pool_ : std::vector - - beer_pool_ : std::vector - - checkin_pool_ : std::vector - -- - + Run() : bool - - RunUserPhase(locations : const std::vector&) : void - - RunBreweryAndBeerPhase(locations : const std::vector&) : void - - RunCheckinPhase() : void - - RunRatingPhase() : void - } -} - - -' ========================================== -' INFRASTRUCTURE: PRELOADING -' ========================================== -package "Infrastructure: Preloading" { - class JsonLoader { + LoadLocations(filepath : const std::filesystem::path&) : std::vector + LoadBeerStyles(filepath : const std::filesystem::path&) : std::vector @@ -353,10 +337,6 @@ package "Infrastructure: Preloading" { } - -' ========================================== -' INFRASTRUCTURE: ENRICHMENT -' ========================================== package "Infrastructure: Enrichment" { interface EnrichmentService <> { @@ -382,18 +362,14 @@ package "Infrastructure: Enrichment" { } - -' ========================================== -' INFRASTRUCTURE: GENERATION -' ========================================== -package "Infrastructure: Generation" { +package "Infrastructure: Data Generation" { interface DataGenerator <> { + GenerateBrewery(location : const Location&,\n context : const LocationContext&) : BreweryResult - + GenerateBeer(brewery_id : sqlite3_int64,\n location : const Location&,\n context : const LocationContext&,\n style : const BeerStyle&) : BeerResult + + GenerateBeer(brewery_id : uint64_t,\n location : const Location&,\n context : const LocationContext&,\n style : const BeerStyle&) : BeerResult + GenerateUser(location : const Location&) : UserResult + GenerateCheckin(user : const GeneratedUser&,\n brewery : const GeneratedBrewery&,\n timestamp : const std::string&) : CheckinResult - + GenerateRating(user : const GeneratedUser&,\n beer : const GeneratedBeer&,\n checkin_id : sqlite3_int64) : RatingResult + + GenerateRating(user : const GeneratedUser&,\n beer : const GeneratedBeer&,\n checkin_id : uint64_t) : RatingResult } class MockGenerator { @@ -432,39 +408,16 @@ package "Infrastructure: Generation" { } - -' ========================================== -' INFRASTRUCTURE: PIPELINE CHANNEL -' ========================================== -package "Infrastructure: Pipeline Channel" { - - class "BoundedChannel" as BoundedChannel { - - queue_ : std::queue - - mutex_ : std::mutex - - not_full_ : std::condition_variable - - not_empty_ : std::condition_variable - - capacity_ : size_t - - closed_ : bool - + Send(item : T) : void - + Receive() : std::optional - + Close() : void - } - -} - - -' ========================================== -' INFRASTRUCTURE: EXPORT -' ========================================== -package "Infrastructure: Export" { +package "Infrastructure: Data Export" { interface ExportService <> { + Initialize() : void - + ProcessBrewery(brewery : const GeneratedBrewery&) : sqlite3_int64 - + ProcessBeer(beer : const GeneratedBeer&) : sqlite3_int64 - + ProcessUser(user : const GeneratedUser&) : sqlite3_int64 - + ProcessCheckin(checkin : const GeneratedCheckin&) : sqlite3_int64 + + ProcessBrewery(brewery : const GeneratedBrewery&) : uint64_t + + ProcessBeer(beer : const GeneratedBeer&) : uint64_t + + ProcessUser(user : const GeneratedUser&) : uint64_t + + ProcessCheckin(checkin : const GeneratedCheckin&) : uint64_t + ProcessRating(rating : const GeneratedRating&) : void + + ProcessFollow(follow : const GeneratedFollow&) : void + Finalize() : void } @@ -477,15 +430,17 @@ package "Infrastructure: Export" { - insert_user_stmt_ : SqliteStatementHandle - insert_checkin_stmt_ : SqliteStatementHandle - insert_rating_stmt_ : SqliteStatementHandle + - insert_follow_stmt_ : SqliteStatementHandle - transaction_open_ : bool - - location_cache_ : std::unordered_map - - brewery_cache_ : std::unordered_map + - location_cache_ : std::unordered_map + - brewery_cache_ : std::unordered_map + Initialize() : void - + ProcessBrewery(brewery : const GeneratedBrewery&) : sqlite3_int64 - + ProcessBeer(beer : const GeneratedBeer&) : sqlite3_int64 - + ProcessUser(user : const GeneratedUser&) : sqlite3_int64 - + ProcessCheckin(checkin : const GeneratedCheckin&) : sqlite3_int64 + + ProcessBrewery(brewery : const GeneratedBrewery&) : uint64_t + + ProcessBeer(beer : const GeneratedBeer&) : uint64_t + + ProcessUser(user : const GeneratedUser&) : uint64_t + + ProcessCheckin(checkin : const GeneratedCheckin&) : uint64_t + ProcessRating(rating : const GeneratedRating&) : void + + ProcessFollow(follow : const GeneratedFollow&) : void + Finalize() : void - InitializeSchema() : void - PrepareStatements() : void @@ -504,9 +459,34 @@ package "Infrastructure: Export" { } -' ========================================== -' GLOBAL RELATIONSHIPS -' ========================================== + +class BiergartenPipelineOrchestrator { + - preloader_ : std::unique_ptr + - enrichment_service_ : std::unique_ptr + - generator_ : std::unique_ptr + - logger_ : std::unique_ptr + - exporter_ : std::unique_ptr + - brewery_context_strategy_ : std::unique_ptr + - sampling_strategy_ : std::unique_ptr + - beer_selection_strategy_ : std::unique_ptr + - checkin_strategy_ : std::unique_ptr + - follow_strategy_ : std::unique_ptr + - beer_style_palette_ : std::vector + - options_ : ApplicationOptions + -- + - user_pool_ : std::vector + - brewery_pool_ : std::vector + - beer_pool_ : std::vector + - checkin_pool_ : std::vector + - follow_pool_ : std::vector + -- + + Run() : bool + - RunUserPhase(locations : const std::vector&) : void + - RunBreweryAndBeerPhase(locations : const std::vector&) : void + - RunCheckinPhase() : void + - RunRatingPhase() : void + - RunFollowPhase() : void +} ' --- Orchestration Aggregations (Services & Strategies) --- BiergartenPipelineOrchestrator *-- DataPreloader @@ -514,6 +494,7 @@ BiergartenPipelineOrchestrator *-- EnrichmentService BiergartenPipelineOrchestrator *-- DataGenerator BiergartenPipelineOrchestrator *-- ExportService BiergartenPipelineOrchestrator *-- CheckinDistributionStrategy +BiergartenPipelineOrchestrator *-- FollowGenerationStrategy BiergartenPipelineOrchestrator *-- SamplingStrategy BiergartenPipelineOrchestrator *-- BeerSelectionStrategy BiergartenPipelineOrchestrator *-- ApplicationOptions @@ -524,6 +505,7 @@ BiergartenPipelineOrchestrator *-- "0..*" GeneratedUser : user_pool_ BiergartenPipelineOrchestrator *-- "0..*" GeneratedBrewery : brewery_pool_ BiergartenPipelineOrchestrator *-- "0..*" GeneratedBeer : beer_pool_ BiergartenPipelineOrchestrator *-- "0..*" GeneratedCheckin : checkin_pool_ +BiergartenPipelineOrchestrator *-- "0..*" GeneratedFollow : follow_pool_ ' --- Interfaces & Implementations --- DataPreloader <|.. JsonLoader @@ -533,6 +515,9 @@ ContextStrategy <|.. BeerContextStrategy SamplingStrategy <|.. UniformSamplingStrategy BeerSelectionStrategy <|.. RandomBeerSelectionStrategy CheckinDistributionStrategy <|.. JCurveCheckinStrategy +CheckinDistributionStrategy <|.. RandomCheckinStrategy +FollowGenerationStrategy <|.. RandomFollowStrategy +FollowGenerationStrategy <|.. ActivityWeightedFollowStrategy EnrichmentService <|.. WikipediaService WebClient <|.. CURLWebClient DataGenerator <|.. MockGenerator @@ -557,12 +542,18 @@ EnrichedCity *-- Location EnrichedCity *-- LocationContext GeneratedBrewery *-- Location GeneratedBrewery *-- BreweryResult +GeneratedBrewery *-- GenerationMetadata GeneratedBeer *-- Location GeneratedBeer *-- BeerStyle GeneratedBeer *-- BeerResult +GeneratedBeer *-- GenerationMetadata GeneratedUser *-- Location GeneratedUser *-- UserResult +GeneratedUser *-- GenerationMetadata GeneratedCheckin *-- CheckinResult +GeneratedCheckin *-- GenerationMetadata GeneratedRating *-- RatingResult +GeneratedRating *-- GenerationMetadata +GeneratedFollow *-- GenerationMetadata @enduml diff --git a/pipeline/diagrams/planned/output/biergarten_activity.svg b/pipeline/diagrams/planned/output/biergarten_activity.svg new file mode 100644 index 0000000..0571a83 --- /dev/null +++ b/pipeline/diagrams/planned/output/biergarten_activity.svg @@ -0,0 +1 @@ +The Biergarten Data Pipeline — Activity DiagramThe Biergarten Data Pipeline — Activity DiagramParseArguments(argc, argv)spdlog::erroryesInvalid args?noInit CurlGlobalState & LlamaBackendStateBuild DI injectorOpens SQLite connection.Begins a single transactioncovering all five fixture types.Initialize SqliteExportServiceCreate BoundedChannel<LogEntry> log_chLog worker drains log_ch for theentire pipeline lifetime.All workers emit LogEntry structsvia PipelineLogger -- never spdlog directly.Spawn Log Worker threadBiergartenPipelineOrchestrator::Run()COMMIT covers all five fixture types.Finalize SqliteExportServiceClose log_chDrain guarantees no LogEntry isdropped at shutdown.Join Log Workerspdlog::info "Pipeline complete in X ms"JsonLoader::LoadBeerStyles("beer-styles.json")EnrichmentService::PreWarmBeerStyleCache(beer_styles)JsonLoader::LoadLocations("locations.json")EnrichmentService::PreWarmLocationCache(sampled_locations)JsonLoader::LoadNamesByCountry("names-by-country.json")JsonLoader::LoadPersonas("personas.json")RunUserPhase(sampled_locations)Create BoundedChannels(loc_ch, exp_ch)Loop: Send Locations -> loc_chProducer closes loc_ch.LLM Worker while loopterminates on empty + closed.Close loc_chJoin LLM Worker, SQLite WorkerRunBreweryPhase(sampled_locations)Create BoundedChannels(loc_ch, exp_ch)Loop: Send Locations -> loc_chClose loc_chbrewery_pool_ is now fully populated.Phase 1b may begin.Join LLM Worker, SQLite WorkerRunBeerPhase()Create BoundedChannels(brew_ch, exp_ch)Loop: Send Breweries -> brew_chClose brew_chBoth brewery_pool_ and beer_pool_are now completely populated.Join LLM Worker, SQLite WorkerRunCheckinPhase()Weights seeded from each user'spersona.checkin_weight. J-curve profileemerges from persona distribution.ICheckinDistributionStrategy::AssignActivityWeights(user_pool_)CheckinsForUser(user, brewery_pool_.size())TimestampFor(user, index)Select brewery from brewery_pool_GenerateCheckin(user, brewery, timestamp)via DataGeneratorProcessCheckin(checkin)PipelineLogger::Log(Info, CheckinGeneration,nullopt, checkin_id, "sqlite")Append -> checkin_pool_remainingFor each checkin index?doneremainingFor each GeneratedUser in user_pool_?doneBeer selection biased byuser.persona.style_affinities and abv_range.Rating skew modulated per persona.RunRatingPhase()Match brewery_id, select beer from beer_pool_(same brewery_id, biased by persona affinities)Beer exists for brewery?yesnoGenerateRating(user, beer, checkin_id)via DataGeneratorProcessRating(rating)PipelineLogger::Log(Info, RatingGeneration,nullopt, rating_id, "sqlite")PipelineLogger::Log(Warn, RatingGeneration,nullopt, brewery_id, "sqlite")Skip -- brewery has no beersremainingFor each GeneratedCheckin in checkin_pool_?doneReceive LocationGuaranteed cache hit from startup.GetLocationContextFromCache(location)Guaranteed cache hit from startup.Returns a Persona struct carryingstyle_affinities, abv_range,ibu_preference, checkin_weight.IPersonaSelectionStrategy::SelectPersona(personas_palette_)Deterministic lookup -- no LLM involved.Name selected from pre-keyed tableand passed into the generation prompt.NamesByCountry::SampleName(location.iso3166_1)LLM receives: EnrichedCity context + personadescription + sampled name. Generatesbio and preference signals groundedin locale and persona.GenerateUser(enriched_city, persona, sampled_name)via DataGeneratorPipelineLogger::Log(Info, UserGeneration,city, user_id, "llm")Send GeneratedUser -> exp_chyesloc_ch has items?noProducer closes exp_ch.SQLite Worker while loopterminates on empty + closed.Close exp_chReceive LocationGuaranteed cache hit from startup.GetLocationContextFromCache(location)KV cache stays warm across allbrewery generations -- system promptdoes not change within this phase.GenerateBrewery(enriched_city, context)via DataGeneratorPipelineLogger::Log(Info,BreweryGeneration,city, brewery_id, "llm")Send GeneratedBrewery -> exp_chyesloc_ch has items?noClose exp_chReceive GeneratedBreweryIBeerSelectionStrategy::SelectStyles(brewery, beer_style_palette_)Guaranteed cache hit from startup.KV cache stays warm across allbeer generations -- system promptdoes not change within this phase.GetStyleContextFromCache(style)GenerateBeer(brewery, style_context)via DataGeneratorAttach GeneratedBeer to bundleremainingFor each selected BeerStyle?donePipelineLogger::Log(Info,BeerGeneration,city, brewery_id, "llm")Send BeersBundle -> exp_chyesbrew_ch has items?noClose exp_chReceive GeneratedUserProcessUser(user)PipelineLogger::Log(Info, UserGeneration,city, user_id, "sqlite")Append -> user_pool_yesexp_ch has items?noReceive GeneratedBreweryProcessBrewery(brewery)PipelineLogger::Log(Info,BreweryGeneration,city, brewery_id, "sqlite")Append -> brewery_pool_yesexp_ch has items?noReceive BeersBundleSet beer.brewery_id from bundleProcessBeer(beer)Append -> beer_pool_remainingFor each beer in bundle?donePipelineLogger::Log(Info,BeerGeneration,city, brewery_id, "sqlite")yesexp_ch has items?noMainBiergartenPipelineOrchestrator::Run()OrchestratorLLM WorkerSQLite Worker \ No newline at end of file diff --git a/pipeline/diagrams/planned/output/class.svg b/pipeline/diagrams/planned/output/class.svg new file mode 100644 index 0000000..559ce29 --- /dev/null +++ b/pipeline/diagrams/planned/output/class.svg @@ -0,0 +1 @@ +DomainDomain ModelsDomain: Application ConfigurationDomain PolicyInfrastructureLoggingPipeline ChannelData PreloadingEnrichmentData GenerationData ExportLocationcity : std::stringstate_province : std::stringiso3166_2 : std::stringcountry : std::stringiso3166_1 : std::stringlocal_languages : std::vector<std::string>latitude : doublelongitude : doubleLocationContexttext : std::stringcompleteness : Completenesschar_count : size_tCompletenessFullPartialAbsentEnrichedCitylocation : Locationcontext : LocationContextBeerStylename : std::stringdescription : std::stringmin_abv : floatmax_abv : floatmin_ibu : intmax_ibu : intBreweryResultname_en : std::stringdescription_en : std::stringname_local : std::stringdescription_local : std::stringBeerResultname_en : std::stringdescription_en : std::stringname_local : std::stringdescription_local : std::stringstyle : std::stringabv : floatibu : intUserResultusername : std::stringbio : std::stringactivity_weight : floatCheckinResultchecked_in_at : std::stringnote : std::stringRatingResultscore : floatnote : std::stringGeneratedBrewerybrewery_id : sqlite3_int64location : Locationbrewery : BreweryResultcontext_completeness : LocationContext::Completenessgenerated_at : std::stringGeneratedBeerbeer_id : sqlite3_int64brewery_id : sqlite3_int64location : Locationstyle : BeerStylebeer : BeerResultgenerated_at : std::stringGeneratedUseruser_id : sqlite3_int64location : Locationuser : UserResultgenerated_at : std::stringGeneratedCheckincheckin_id : sqlite3_int64user_id : sqlite3_int64brewery_id : sqlite3_int64checkin : CheckinResultgenerated_at : std::stringGeneratedRatinguser_id : sqlite3_int64beer_id : sqlite3_int64checkin_id : sqlite3_int64rating : RatingResultgenerated_at : std::stringSamplingOptionstemperature : float = 1.0Ftop_p : float = 0.95Ftop_k : uint32_t = 64n_ctx : uint32_t = 8192seed : int = -1GeneratorOptionsmodel_path : std::filesystem::pathuse_mocked : bool = falsesampling : SamplingOptionsPipelineOptionsoutput_path : std::filesystem::pathlog_path : std::filesystem::pathApplicationOptionsgenerator : GeneratorOptionspipeline : PipelineOptions«interface»ContextStrategyQueriesFor(loc : const Location&) : std::vector<std::string>MaxContextChars() : size_tBreweryContextStrategyQueriesFor(loc : const Location&) : std::vector<std::string>MaxContextChars() : size_tBeerContextStrategyQueriesFor(loc : const Location&) : std::vector<std::string>MaxContextChars() : size_t«interface»SamplingStrategySample(locations : const std::vector<Location>&) : std::vector<Location>UniformSamplingStrategysample_size_ : size_tSample(locations : const std::vector<Location>&) : std::vector<Location>«interface»BeerSelectionStrategySelectStyles(brewery : const GeneratedBrewery&,palette : std::span<const BeerStyle>) : std::vector<BeerStyle>RandomBeerSelectionStrategyrng_ : std::mt19937min_beers_ : size_tmax_beers_ : size_tSelectStyles(brewery : const GeneratedBrewery&,palette : std::span<const BeerStyle>) : std::vector<BeerStyle>«interface»CheckinDistributionStrategyAssignActivityWeights(users : std::vector<GeneratedUser>&) : voidCheckinsForUser(user : const GeneratedUser&,brewery_count : size_t) : size_tTimestampFor(user : const GeneratedUser&,index : size_t) : std::stringJCurveCheckinStrategyrng_ : std::mt19937AssignActivityWeights(users : std::vector<GeneratedUser>&) : voidCheckinsForUser(user : const GeneratedUser&,brewery_count : size_t) : size_tTimestampFor(user : const GeneratedUser&,index : size_t) : std::stringLogLevelDebugInfoWarnErrorPipelinePhaseStartupUserGenerationBreweryAndBeerGenerationCheckinGenerationRatingGenerationTeardownLogEntrytimestamp : std::chrono::system_clock::time_pointlevel : LogLevelphase : PipelinePhasemessage : std::stringcity : std::optional<std::string>entity_id : std::optional<std::string>worker : std::optional<std::string>«interface»LoggerLog(level, phase, message,city, entity_id, worker) : voidPipelineLoggerlog_ch_ : BoundedChannel<LogEntry>&Log(level, phase, message,city, entity_id, worker) : voidLogWorkerlog_ch_ : BoundedChannel<LogEntry>&Run() : voidFormatTimestamp(tp) : std::stringToSpdlogLevel(level) : spdlog::level::level_enumToString(phase) : std::stringBoundedChannelTqueue_ : std::queue<T>mutex_ : std::mutexnot_full_ : std::condition_variablenot_empty_ : std::condition_variablecapacity_ : size_tclosed_ : boolSend(item : T) : voidReceive() : std::optional<T>Close() : void«interface»DataPreloaderLoadLocations(filepath : const std::filesystem::path&) : std::vector<Location>LoadBeerStyles(filepath : const std::filesystem::path&) : std::vector<BeerStyle>LoadPersonas(filepath : const std::filesystem::path&) : std::vector<Persona>LoadNamesByCountry(filepath : const std::filesystem::path&) : NamesByCountryJsonLoaderLoadLocations(filepath : const std::filesystem::path&) : std::vector<Location>LoadBeerStyles(filepath : const std::filesystem::path&) : std::vector<BeerStyle>LoadPersonas(filepath : const std::filesystem::path&) : std::vector<Persona>LoadNamesByCountry(filepath : const std::filesystem::path&) : NamesByCountry«interface»EnrichmentServiceGetLocationContext(loc : const Location&,strategy : const ContextStrategy&) : LocationContextWikipediaServiceclient_ : std::unique_ptr<WebClient>extract_cache_ : std::unordered_map<std::string, std::string>GetLocationContext(loc : const Location&,strategy : const ContextStrategy&) : LocationContextFetchExtract(query : std::string_view) : std::string«interface»WebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::stringCURLWebClientGet(url : const std::string&) : std::stringUrlEncode(value : const std::string&) : std::string«interface»DataGeneratorGenerateBrewery(location : const Location&,context : const LocationContext&) : BreweryResultGenerateBeer(brewery_id : sqlite3_int64,location : const Location&,context : const LocationContext&,style : const BeerStyle&) : BeerResultGenerateUser(location : const Location&) : UserResultGenerateCheckin(user : const GeneratedUser&,brewery : const GeneratedBrewery&,timestamp : const std::string&) : CheckinResultGenerateRating(user : const GeneratedUser&,beer : const GeneratedBeer&,checkin_id : sqlite3_int64) : RatingResultMockGeneratorGenerateBrewery(...) : BreweryResultGenerateBeer(...) : BeerResultGenerateUser(...) : UserResultGenerateCheckin(...) : CheckinResultGenerateRating(...) : RatingResultDeterministicHash(location : const Location&) : size_tLlamaGeneratormodel_ : ModelHandlecontext_ : ContextHandleprompt_formatter_ : std::unique_ptr<PromptFormatter>rng_ : std::mt19937GenerateBrewery(...) : BreweryResultGenerateBeer(...) : BeerResultGenerateUser(...) : UserResultGenerateCheckin(...) : CheckinResultGenerateRating(...) : RatingResultLoad(opts : const GeneratorOptions&) : voidInfer(system_prompt, user_prompt,max_tokens, grammar) : std::stringValidateModelArchitecture() : void«interface»PromptFormatterFormat(system_prompt : std::string_view,user_prompt : std::string_view) : std::stringExpectedArchitecture() : std::string_viewGemma4JinjaPromptFormatterFormat(...) : std::stringExpectedArchitecture() : std::string_view«interface»ExportServiceInitialize() : voidProcessBrewery(brewery : const GeneratedBrewery&) : sqlite3_int64ProcessBeer(beer : const GeneratedBeer&) : sqlite3_int64ProcessUser(user : const GeneratedUser&) : sqlite3_int64ProcessCheckin(checkin : const GeneratedCheckin&) : sqlite3_int64ProcessRating(rating : const GeneratedRating&) : voidFinalize() : voidSqliteExportServicedate_time_provider_ : std::unique_ptr<DateTimeProvider>db_handle_ : SqliteDatabaseHandleinsert_location_stmt_ : SqliteStatementHandleinsert_brewery_stmt_ : SqliteStatementHandleinsert_beer_stmt_ : SqliteStatementHandleinsert_user_stmt_ : SqliteStatementHandleinsert_checkin_stmt_ : SqliteStatementHandleinsert_rating_stmt_ : SqliteStatementHandletransaction_open_ : boollocation_cache_ : std::unordered_map<std::string, sqlite3_int64>brewery_cache_ : std::unordered_map<std::string, sqlite3_int64>Initialize() : voidProcessBrewery(brewery : const GeneratedBrewery&) : sqlite3_int64ProcessBeer(beer : const GeneratedBeer&) : sqlite3_int64ProcessUser(user : const GeneratedUser&) : sqlite3_int64ProcessCheckin(checkin : const GeneratedCheckin&) : sqlite3_int64ProcessRating(rating : const GeneratedRating&) : voidFinalize() : voidInitializeSchema() : voidPrepareStatements() : voidRollbackAndCloseNoThrow() : voidFinalizeStatements() : void«interface»DateTimeProviderGetUtcTimestamp() : std::stringSystemDateTimeProviderGetUtcTimestamp() : std::stringBiergartenPipelineOrchestratorpreloader_ : std::unique_ptr<DataPreloader>enrichment_service_ : std::unique_ptr<EnrichmentService>generator_ : std::unique_ptr<DataGenerator>logger_ : std::unique_ptr<Logger>exporter_ : std::unique_ptr<ExportService>brewery_context_strategy_ : std::unique_ptr<ContextStrategy>sampling_strategy_ : std::unique_ptr<SamplingStrategy>beer_selection_strategy_ : std::unique_ptr<BeerSelectionStrategy>checkin_strategy_ : std::unique_ptr<CheckinDistributionStrategy>beer_style_palette_ : std::vector<BeerStyle>options_ : ApplicationOptionsuser_pool_ : std::vector<GeneratedUser>brewery_pool_ : std::vector<GeneratedBrewery>beer_pool_ : std::vector<GeneratedBeer>checkin_pool_ : std::vector<GeneratedCheckin>Run() : boolRunUserPhase(locations : const std::vector<Location>&) : voidRunBreweryAndBeerPhase(locations : const std::vector<Location>&) : voidRunCheckinPhase() : voidRunRatingPhase() : voidemitsconsumesuser_pool_0..*brewery_pool_0..*beer_pool_0..*checkin_pool_0..*logs todrains from \ No newline at end of file