update readme

2026-07-16 17:47:22 +00:00 · 2026-04-08 11:27:37 -04:00
parent b1ac3a6068
commit 3c7e74e3c1
1 changed files with 56 additions and 378 deletions
--- a/pipeline/README.md
+++ b/pipeline/README.md
@@ -1,406 +1,84 @@
 # Biergarten Pipeline
-A high-performance C++23 data pipeline for fetching, parsing, and storing geographic data (countries, states, cities) with brewery metadata generation capabilities. The system supports both mock and LLM-based (llama.cpp) generation modes.
+A C++23 tool for processing geographic data and generating brewery metadata. It utilizes a local city manifest, parallel Wikipedia enrichment via `std::async`, and local LLM inference via llama.cpp.
 ## Overview
-The pipeline orchestrates **four key stages**:
+The pipeline runs in four stages:
-1. **Download** - Fetches `countries+states+cities.json` from a pinned GitHub commit with optional local filesystem caching
+- **Query**: Loads and samples from a local `locations.json` manifest.
-2. **Parse** - Streams JSON using Boost.JSON's `basic_parser` to extract country/state/city records without loading the entire file into memory
+- **Enrich**: Fetches regional and cultural context from Wikipedia in parallel using `std::async`.
-3. **Store** - Inserts records into a file-based SQLite database with all operations performed sequentially in a single thread
+- **Generate**: Creates authentic brewery names and descriptions using a local GGUF model or a deterministic mock.
-4. **Generate** - Produces brewery metadata or user profiles (mock implementation; supports future LLM integration via llama.cpp)
+- **Log**: Outputs results and metadata summaries via spdlog.
-## System Architecture
+## Implementation Details
-### Data Sources and Formats
+### Concurrency
- **Hierarchical Structure**: Countries array → states per country → cities per state
+- **Async Enrichment**: Wikipedia API lookups are parallelized using `std::async`. Each city is processed in its own thread to hide network latency.
- **Data Fields**:
+- **RAII**: Resource management for libcurl handles and llama.cpp weights is handled via constructors/destructors to ensure clean teardown.
  - `id` (integer)
  - `name` (string)
  - `iso2` / `iso3` (ISO country/state codes)
  - `latitude` / `longitude` (geographic coordinates)
 - **Source**: [dr5hn/countries-states-cities-database](https://github.com/dr5hn/countries-states-cities-database) on GitHub
 - **Output**: Structured SQLite file-based database (`biergarten-pipeline.db`) + structured logging via spdlog
-### Concurrency Model
+### LLM Logic
-The pipeline currently operates **single-threaded** with sequential stage execution:
+- **Retries**: Includes a 3-attempt loop with automated error correction. If the model returns invalid JSON, the specific error is fed back into the next prompt.
 - **Context Injection**: Wikipedia summaries are injected into the LLM system prompt to ensure descriptions are grounded in actual regional beer culture.
 - **Sampling**: Temperature, top-p, and seeds are configurable via the CLI.
-1. **Download Phase**: Main thread blocks while downloading the source JSON file (if not in cache)
+## Hardware & GPU Config
 2. **Parse & Store Phase**: Main thread performs streaming JSON parse with immediate SQLite inserts
-**Thread Safety**: While single-threaded, the `SqliteDatabase` component is **mutex-protected** using `std::mutex` (`dbMutex`) for all database operations. This design enables safe future parallelization without code modifications.
+### Test Machine
 - **Host**: ThinkPad P1 Gen 7 (Fedora 43)
 - **CPU**: Intel Core Ultra 7 155H
 - **GPU**: NVIDIA RTX 2000 Ada Generation
 - **Memory**: 32GB
 - **Model**: Qwen3-8B-Q6-K
 - **Inference**: llama.cpp with CUDA 12.x support
 ### GPU Build Flags
 ```bash
 cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89 ..
 cmake --build . --config Release
 ```
 ## Core Components
-| Component                     | Purpose                                                                                         | Thread Safety                                | Dependencies                                  |
+| Component               | Function                                                          |
-| ----------------------------- | ----------------------------------------------------------------------------------------------- | -------------------------------------------- | --------------------------------------------- |
+| ----------------------- | ----------------------------------------------------------------- |
-| **BiergartenDataGenerator**   | Orchestrates pipeline execution; manages lifecycle of downloader, parser, and generator         | Single-threaded coordinator                  | ApplicationOptions, WebClient, SqliteDatabase |
+| BiergartenDataGenerator | Orchestrates the sampling, enrichment, and generation stages.     |
-| **DataDownloader**            | HTTP fetch with curl; optional filesystem cache; ETag support and retries                       | Blocking I/O; safe for startup               | IWebClient, filesystem                        |
+| WikipediaService        | Fetches and caches summaries for cities and regional beer styles. |
-| **StreamingJsonParser**       | Extends `boost::json::basic_parser`; emits country/state/city via callbacks; tracks parse depth | Single-threaded parse; callbacks thread-safe | Boost.JSON                                    |
+| LlamaGenerator          | Handles local GGUF inference and output validation.               |
-| **JsonLoader**                | Wraps parser; dispatches callbacks for country/state/city; manages WorkQueue lifecycle          | Produces to WorkQueue; safe callbacks        | StreamingJsonParser, SqliteDatabase           |
+| JsonLoader              | Parses the local `locations.json` file into internal structures.  |
-| **SqliteDatabase**            | Manages schema initialization; insert/query methods for geographic data                         | Mutex-guarded all operations                 | SQLite3                                       |
+| CURLWebClient           | libcurl wrapper for parallel Wikipedia API requests.              |
 | **IDataGenerator** (Abstract) | Interface for brewery/user metadata generation                                                  | Stateless virtual methods                    | N/A                                           |
 | **LlamaGenerator**            | LLM-based generation via llama.cpp; configurable sampling (temperature, top-p, seed)            | Manages llama_model* and llama_context*      | llama.cpp, BreweryResult, UserResult          |
 | **MockGenerator**             | Deterministic mock generation using seeded randomization                                        | Stateless; thread-safe                       | N/A                                           |
 | **CURLWebClient**             | HTTP client adapter; URL encoding; file downloads                                               | cURL library bindings                        | libcurl                                       |
 | **WikipediaService**          | (Planned) Wikipedia data lookups for enrichment                                                 | N/A                                          | IWebClient                                    |
-## Database Schema
+## CLI Options
-SQLite file-based database with **three core tables** and **indexes for fast lookups**:
+```
-
+./biergarten-pipeline --model ./path/to/model.gguf [options]
 ### Countries
 ```sql
 CREATE TABLE countries (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  iso2 TEXT,
  iso3 TEXT
 );
 CREATE INDEX idx_countries_iso2 ON countries(iso2);
 ```
-### States
+| Flag            | Description                                     |
 | --------------- | ----------------------------------------------- |
 | `--mocked`      | Use deterministic mock data instead of an LLM.  |
 | `--model`, `-m` | Path to the GGUF file.                          |
 | `--temperature` | Model temperature (0.0 - 1.0).                  |
 | `--n-ctx`       | Context window size (default: 8192).            |
 | `--cache-dir`   | Directory containing the `locations.json` file. |
-```sql
+## Building
 CREATE TABLE states (
  id INTEGER PRIMARY KEY,
  country_id INTEGER NOT NULL,
  name TEXT NOT NULL,
  iso2 TEXT,
  FOREIGN KEY (country_id) REFERENCES countries(id)
 );
 CREATE INDEX idx_states_country ON states(country_id);
 ```
-### Cities
+### Requirements
-```sql
+- C++23 compiler (GCC 13+ / Clang 16+)
-CREATE TABLE cities (
+- CMake 3.20+
-  id INTEGER PRIMARY KEY,
+- Boost (JSON, Program_options), libcurl
-  state_id INTEGER NOT NULL,
+- CUDA Toolkit 12.x (optional for GPU)
  country_id INTEGER NOT NULL,
  name TEXT NOT NULL,
  latitude REAL,
  longitude REAL,
  FOREIGN KEY (state_id) REFERENCES states(id),
  FOREIGN KEY (country_id) REFERENCES countries(id)
 );
 CREATE INDEX idx_cities_state ON cities(state_id);
 CREATE INDEX idx_cities_country ON cities(country_id);
 ```
-## Architecture Diagram
+### Steps
 ```plantuml
@startuml biergarten-pipeline
 !theme plain
 skinparam monochrome true
 skinparam classBackgroundColor #FFFFFF
 skinparam classBorderColor #000000
 package "Application Layer" {
  class BiergartenDataGenerator {
    - options: ApplicationOptions
    - webClient: IWebClient
    - database: SqliteDatabase
    - generator: IDataGenerator
    --
    + Run() : int
  }
 }
 package "Data Acquisition" {
  class DataDownloader {
    - webClient: IWebClient
    --
    + Download(url: string, filePath: string)
    + DownloadWithCache(url: string, cachePath: string)
  }
  interface IWebClient {
    + DownloadToFile(url: string, filePath: string)
    + Get(url: string) : string
    + UrlEncode(value: string) : string
  }
  class CURLWebClient {
    - globalState: CurlGlobalState
    --
    + DownloadToFile(url: string, filePath: string)
    + Get(url: string) : string
    + UrlEncode(value: string) : string
  }
 }
 package "JSON Processing" {
  class StreamingJsonParser {
    - depth: int
    --
    + on_object_begin()
    + on_object_end()
    + on_array_begin()
    + on_array_end()
    + on_key(str: string)
    + on_string(str: string)
    + on_number(value: int)
  }
  class JsonLoader {
    --
    + LoadWorldCities(jsonPath: string, db: SqliteDatabase)
  }
 }
 package "Data Storage" {
  class SqliteDatabase {
    - db: sqlite3*
    - dbMutex: std::mutex
    --
    + Initialize(dbPath: string)
    + InsertCountry(id: int, name: string, iso2: string, iso3: string)
    + InsertState(id: int, countryId: int, name: string, iso2: string)
    + InsertCity(id: int, stateId: int, countryId: int, name: string, lat: double, lon: double)
    + QueryCountries(limit: int) : vector<Country>
    + QueryStates(limit: int) : vector<State>
    + QueryCities() : vector<City>
    + BeginTransaction()
    + CommitTransaction()
    # InitializeSchema()
  }
  struct Country {
    id: int
    name: string
    iso2: string
    iso3: string
  }
  struct State {
    id: int
    name: string
    iso2: string
    countryId: int
  }
  struct City {
    id: int
    name: string
    countryId: int
  }
 }
 package "Data Generation" {
  interface IDataGenerator {
    + load(modelPath: string)
    + generateBrewery(cityName: string, countryName: string, regionContext: string) : BreweryResult
    + generateUser(locale: string) : UserResult
  }
  class LlamaGenerator {
    - model: llama_model*
    - context: llama_context*
    - sampling_temperature: float
    - sampling_top_p: float
    - sampling_seed: uint32_t
    --
    + load(modelPath: string)
    + generateBrewery(...) : BreweryResult
    + generateUser(locale: string) : UserResult
    + setSamplingOptions(temperature: float, topP: float, seed: int)
    # infer(prompt: string) : string
  }
  class MockGenerator {
    --
    + load(modelPath: string)
    + generateBrewery(...) : BreweryResult
    + generateUser(locale: string) : UserResult
  }
  struct BreweryResult {
    name: string
    description: string
  }
  struct UserResult {
    username: string
    bio: string
  }
 }
 package "Enrichment (Planned)" {
  class WikipediaService {
    - webClient: IWebClient
    --
    + SearchCity(cityName: string, countryName: string) : string
  }
 }
 ' Relationships
 BiergartenDataGenerator --> DataDownloader
 BiergartenDataGenerator --> JsonLoader
 BiergartenDataGenerator --> SqliteDatabase
 BiergartenDataGenerator --> IDataGenerator
 DataDownloader --> IWebClient
 CURLWebClient ..|> IWebClient
 JsonLoader --> StreamingJsonParser
 JsonLoader --> SqliteDatabase
 LlamaGenerator ..|> IDataGenerator
 MockGenerator ..|> IDataGenerator
 SqliteDatabase --> Country
 SqliteDatabase --> State
 SqliteDatabase --> City
 LlamaGenerator --> BreweryResult
 LlamaGenerator --> UserResult
 MockGenerator --> BreweryResult
 MockGenerator --> UserResult
 WikipediaService --> IWebClient
@enduml
 ```
 ## Configuration and Extensibility
 ### Command-Line Arguments
 Boost.Program_options provides named CLI arguments. Running without arguments displays usage instructions.
 ```bash
-./biergarten-pipeline [options]
+mkdir build && cd build
 ```
 **Requirement**: Exactly one of `--mocked` or `--model` must be specified.
 | Argument        | Short | Type   | Purpose                                                         |
 | --------------- | ----- | ------ | --------------------------------------------------------------- |
 | `--mocked`      | -     | flag   | Use mocked generator for brewery/user data                      |
 | `--model`       | `-m`  | string | Path to LLM model file (gguf); mutually exclusive with --mocked |
 | `--cache-dir`   | `-c`  | path   | Directory for cached JSON (default: `/tmp`)                     |
 | `--temperature` | -     | float  | LLM sampling temperature 0.0-1.0 (default: `0.8`)               |
 | `--top-p`       | -     | float  | Nucleus sampling parameter 0.0-1.0 (default: `0.92`)            |
 | `--seed`        | -     | int    | Random seed: -1 for random (default: `-1`)                      |
 | `--help`        | `-h`  | flag   | Show help message                                               |
 **Note**: The data source is always pinned to commit `c5eb7772` (stable 2026-03-28) and cannot be changed.
 **Note**: When `--mocked` is used, any sampling parameters (`--temperature`, `--top-p`, `--seed`) are ignored with a warning.
 ### Usage Examples
 ```bash
 # Mocked generator (deterministic, no LLM required)
 ./biergarten-pipeline --mocked
 # With LLM model
 ./biergarten-pipeline --model ./models/llama.gguf --cache-dir /var/cache
 # Mocked with extra parameters provided (will be ignored with warning)
 ./biergarten-pipeline --mocked --temperature 0.5 --top-p 0.8 --seed 42
 # Show help
 ./biergarten-pipeline --help
 ```
 ## Building and Running
 ### Prerequisites
 - **C++23 compiler** (g++, clang, MSVC)
 - **CMake** 3.20+
 - **curl** (for HTTP downloads)
 - **sqlite3** (database backend)
 - **Boost** 1.75+ (requires Boost.JSON and Boost.Program_options)
 - **spdlog** v1.11.0 (fetched via CMake FetchContent)
 - **llama.cpp** (fetched via CMake FetchContent for LLM inference)
 ### Build
 ```bash
 mkdir -p build
 cd build
 cmake ..
-cmake --build . --target biergarten-pipeline -- -j
+cmake --build . -j$(nproc)
 ```
 ### Run
 ```bash
 ./build/biergarten-pipeline
 ```
 **Output**:
 - Console logs with structured spdlog output
 - Cached JSON file: `/tmp/countries+states+cities.json`
 - SQLite database: `biergarten-pipeline.db` (in output directory)
 ## Code Quality and Static Analysis
 ### Formatting
 This project uses **clang-format** with the **Google C++ style guide**:
 ```bash
 # Apply formatting to all source files
 cmake --build build --target format
 # Check formatting without modifications
 cmake --build build --target format-check
 ```
 ### Static Analysis
 This project uses **clang-tidy** with configurations for Google, modernize, performance, and bug-prone rules (`.clang-tidy`):
 Static analysis runs automatically during compilation if `clang-tidy` is available.
 ## Code Implementation Summary
 ### Key Achievements
 ✅ **Full pipeline implementation** - Download → Parse → Store → Generate
 ✅ **Streaming JSON parser** - Memory-efficient processing via Boost.JSON callbacks
 ✅ **Thread-safe SQLite wrapper** - Mutex-protected database for future parallelization
 ✅ **Flexible data generation** - Abstract IDataGenerator interface supporting both mock and LLM modes
 ✅ **Comprehensive CLI** - Boost.Program_options with sensible defaults
 ✅ **Production-grade logging** - spdlog integration for structured output
 ✅ **Build quality** - CMake with clang-format/clang-tidy integration
 ### Architecture Patterns
 - **Interface-based design**: `IWebClient`, `IDataGenerator` abstract base classes enable substitution and testing
 - **Dependency injection**: Components receive dependencies via constructors (BiergartenDataGenerator)
 - **RAII principle**: SQLite connections and resources managed via destructors
 - **Callback-driven parsing**: Boost.JSON parser emits events to processing callbacks
 - **Transaction-scoped inserts**: BeginTransaction/CommitTransaction for batch performance
 ### External Dependencies
 | Dependency | Version | Purpose                            | Type    |
 | ---------- | ------- | ---------------------------------- | ------- |
 | Boost      | 1.75+   | JSON parsing, CLI argument parsing | Library |
 | SQLite3    | -       | Persistent data storage            | System  |
 | libcurl    | -       | HTTP downloads                     | System  |
 | spdlog     | v1.11.0 | Structured logging                 | Fetched |
 | llama.cpp  | b8611   | LLM inference engine               | Fetched |
 to validate formatting without modifying files.
 clang-tidy runs automatically on the biergarten-pipeline target when available. You can disable it at configure time:
 cmake -DENABLE_CLANG_TIDY=OFF ..
 You can also disable format helper targets:
 cmake -DENABLE_CLANG_FORMAT_TARGETS=OFF ..