# Biergarten Pipeline A high-performance C++23 data pipeline for fetching, parsing, and storing geographic data (countries, states, cities) with brewery metadata generation capabilities. The system supports both mock and LLM-based (llama.cpp) generation modes. ## Overview The pipeline orchestrates **four key stages**: 1. **Download** - Fetches `countries+states+cities.json` from a pinned GitHub commit with optional local filesystem caching 2. **Parse** - Streams JSON using Boost.JSON's `basic_parser` to extract country/state/city records without loading the entire file into memory 3. **Store** - Inserts records into a file-based SQLite database with all operations performed sequentially in a single thread 4. **Generate** - Produces brewery metadata or user profiles (mock implementation; supports future LLM integration via llama.cpp) ## System Architecture ### Data Sources and Formats - **Hierarchical Structure**: Countries array → states per country → cities per state - **Data Fields**: - `id` (integer) - `name` (string) - `iso2` / `iso3` (ISO country/state codes) - `latitude` / `longitude` (geographic coordinates) - **Source**: [dr5hn/countries-states-cities-database](https://github.com/dr5hn/countries-states-cities-database) on GitHub - **Output**: Structured SQLite file-based database (`biergarten-pipeline.db`) + structured logging via spdlog ### Concurrency Model The pipeline currently operates **single-threaded** with sequential stage execution: 1. **Download Phase**: Main thread blocks while downloading the source JSON file (if not in cache) 2. **Parse & Store Phase**: Main thread performs streaming JSON parse with immediate SQLite inserts **Thread Safety**: While single-threaded, the `SqliteDatabase` component is **mutex-protected** using `std::mutex` (`dbMutex`) for all database operations. This design enables safe future parallelization without code modifications. ## Core Components | Component | Purpose | Thread Safety | Dependencies | | ----------------------------- | ----------------------------------------------------------------------------------------------- | -------------------------------------------- | --------------------------------------------- | | **BiergartenDataGenerator** | Orchestrates pipeline execution; manages lifecycle of downloader, parser, and generator | Single-threaded coordinator | ApplicationOptions, WebClient, SqliteDatabase | | **DataDownloader** | HTTP fetch with curl; optional filesystem cache; ETag support and retries | Blocking I/O; safe for startup | IWebClient, filesystem | | **StreamingJsonParser** | Extends `boost::json::basic_parser`; emits country/state/city via callbacks; tracks parse depth | Single-threaded parse; callbacks thread-safe | Boost.JSON | | **JsonLoader** | Wraps parser; dispatches callbacks for country/state/city; manages WorkQueue lifecycle | Produces to WorkQueue; safe callbacks | StreamingJsonParser, SqliteDatabase | | **SqliteDatabase** | Manages schema initialization; insert/query methods for geographic data | Mutex-guarded all operations | SQLite3 | | **IDataGenerator** (Abstract) | Interface for brewery/user metadata generation | Stateless virtual methods | N/A | | **LlamaGenerator** | LLM-based generation via llama.cpp; configurable sampling (temperature, top-p, seed) | Manages llama_model* and llama_context* | llama.cpp, BreweryResult, UserResult | | **MockGenerator** | Deterministic mock generation using seeded randomization | Stateless; thread-safe | N/A | | **CURLWebClient** | HTTP client adapter; URL encoding; file downloads | cURL library bindings | libcurl | | **WikipediaService** | (Planned) Wikipedia data lookups for enrichment | N/A | IWebClient | ## Database Schema SQLite file-based database with **three core tables** and **indexes for fast lookups**: ### Countries ```sql CREATE TABLE countries ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, iso2 TEXT, iso3 TEXT ); CREATE INDEX idx_countries_iso2 ON countries(iso2); ``` ### States ```sql CREATE TABLE states ( id INTEGER PRIMARY KEY, country_id INTEGER NOT NULL, name TEXT NOT NULL, iso2 TEXT, FOREIGN KEY (country_id) REFERENCES countries(id) ); CREATE INDEX idx_states_country ON states(country_id); ``` ### Cities ```sql CREATE TABLE cities ( id INTEGER PRIMARY KEY, state_id INTEGER NOT NULL, country_id INTEGER NOT NULL, name TEXT NOT NULL, latitude REAL, longitude REAL, FOREIGN KEY (state_id) REFERENCES states(id), FOREIGN KEY (country_id) REFERENCES countries(id) ); CREATE INDEX idx_cities_state ON cities(state_id); CREATE INDEX idx_cities_country ON cities(country_id); ``` ## Architecture Diagram ```plantuml @startuml biergarten-pipeline !theme plain skinparam monochrome true skinparam classBackgroundColor #FFFFFF skinparam classBorderColor #000000 package "Application Layer" { class BiergartenDataGenerator { - options: ApplicationOptions - webClient: IWebClient - database: SqliteDatabase - generator: IDataGenerator -- + Run() : int } } package "Data Acquisition" { class DataDownloader { - webClient: IWebClient -- + Download(url: string, filePath: string) + DownloadWithCache(url: string, cachePath: string) } interface IWebClient { + DownloadToFile(url: string, filePath: string) + Get(url: string) : string + UrlEncode(value: string) : string } class CURLWebClient { - globalState: CurlGlobalState -- + DownloadToFile(url: string, filePath: string) + Get(url: string) : string + UrlEncode(value: string) : string } } package "JSON Processing" { class StreamingJsonParser { - depth: int -- + on_object_begin() + on_object_end() + on_array_begin() + on_array_end() + on_key(str: string) + on_string(str: string) + on_number(value: int) } class JsonLoader { -- + LoadWorldCities(jsonPath: string, db: SqliteDatabase) } } package "Data Storage" { class SqliteDatabase { - db: sqlite3* - dbMutex: std::mutex -- + Initialize(dbPath: string) + InsertCountry(id: int, name: string, iso2: string, iso3: string) + InsertState(id: int, countryId: int, name: string, iso2: string) + InsertCity(id: int, stateId: int, countryId: int, name: string, lat: double, lon: double) + QueryCountries(limit: int) : vector + QueryStates(limit: int) : vector + QueryCities() : vector + BeginTransaction() + CommitTransaction() # InitializeSchema() } struct Country { id: int name: string iso2: string iso3: string } struct State { id: int name: string iso2: string countryId: int } struct City { id: int name: string countryId: int } } package "Data Generation" { interface IDataGenerator { + load(modelPath: string) + generateBrewery(cityName: string, countryName: string, regionContext: string) : BreweryResult + generateUser(locale: string) : UserResult } class LlamaGenerator { - model: llama_model* - context: llama_context* - sampling_temperature: float - sampling_top_p: float - sampling_seed: uint32_t -- + load(modelPath: string) + generateBrewery(...) : BreweryResult + generateUser(locale: string) : UserResult + setSamplingOptions(temperature: float, topP: float, seed: int) # infer(prompt: string) : string } class MockGenerator { -- + load(modelPath: string) + generateBrewery(...) : BreweryResult + generateUser(locale: string) : UserResult } struct BreweryResult { name: string description: string } struct UserResult { username: string bio: string } } package "Enrichment (Planned)" { class WikipediaService { - webClient: IWebClient -- + SearchCity(cityName: string, countryName: string) : string } } ' Relationships BiergartenDataGenerator --> DataDownloader BiergartenDataGenerator --> JsonLoader BiergartenDataGenerator --> SqliteDatabase BiergartenDataGenerator --> IDataGenerator DataDownloader --> IWebClient CURLWebClient ..|> IWebClient JsonLoader --> StreamingJsonParser JsonLoader --> SqliteDatabase LlamaGenerator ..|> IDataGenerator MockGenerator ..|> IDataGenerator SqliteDatabase --> Country SqliteDatabase --> State SqliteDatabase --> City LlamaGenerator --> BreweryResult LlamaGenerator --> UserResult MockGenerator --> BreweryResult MockGenerator --> UserResult WikipediaService --> IWebClient @enduml ``` ## Configuration and Extensibility ### Command-Line Arguments Boost.Program_options provides named CLI arguments. Running without arguments displays usage instructions. ```bash ./biergarten-pipeline [options] ``` **Requirement**: Exactly one of `--mocked` or `--model` must be specified. | Argument | Short | Type | Purpose | | --------------- | ----- | ------ | --------------------------------------------------------------- | | `--mocked` | - | flag | Use mocked generator for brewery/user data | | `--model` | `-m` | string | Path to LLM model file (gguf); mutually exclusive with --mocked | | `--cache-dir` | `-c` | path | Directory for cached JSON (default: `/tmp`) | | `--temperature` | - | float | LLM sampling temperature 0.0-1.0 (default: `0.8`) | | `--top-p` | - | float | Nucleus sampling parameter 0.0-1.0 (default: `0.92`) | | `--seed` | - | int | Random seed: -1 for random (default: `-1`) | | `--help` | `-h` | flag | Show help message | **Note**: The data source is always pinned to commit `c5eb7772` (stable 2026-03-28) and cannot be changed. **Note**: When `--mocked` is used, any sampling parameters (`--temperature`, `--top-p`, `--seed`) are ignored with a warning. ### Usage Examples ```bash # Mocked generator (deterministic, no LLM required) ./biergarten-pipeline --mocked # With LLM model ./biergarten-pipeline --model ./models/llama.gguf --cache-dir /var/cache # Mocked with extra parameters provided (will be ignored with warning) ./biergarten-pipeline --mocked --temperature 0.5 --top-p 0.8 --seed 42 # Show help ./biergarten-pipeline --help ``` ## Building and Running ### Prerequisites - **C++23 compiler** (g++, clang, MSVC) - **CMake** 3.20+ - **curl** (for HTTP downloads) - **sqlite3** (database backend) - **Boost** 1.75+ (requires Boost.JSON and Boost.Program_options) - **spdlog** v1.11.0 (fetched via CMake FetchContent) - **llama.cpp** (fetched via CMake FetchContent for LLM inference) ### Build ```bash mkdir -p build cd build cmake .. cmake --build . --target biergarten-pipeline -- -j ``` ### Run ```bash ./build/biergarten-pipeline ``` **Output**: - Console logs with structured spdlog output - Cached JSON file: `/tmp/countries+states+cities.json` - SQLite database: `biergarten-pipeline.db` (in output directory) ## Code Quality and Static Analysis ### Formatting This project uses **clang-format** with the **Google C++ style guide**: ```bash # Apply formatting to all source files cmake --build build --target format # Check formatting without modifications cmake --build build --target format-check ``` ### Static Analysis This project uses **clang-tidy** with configurations for Google, modernize, performance, and bug-prone rules (`.clang-tidy`): Static analysis runs automatically during compilation if `clang-tidy` is available. ## Code Implementation Summary ### Key Achievements ✅ **Full pipeline implementation** - Download → Parse → Store → Generate ✅ **Streaming JSON parser** - Memory-efficient processing via Boost.JSON callbacks ✅ **Thread-safe SQLite wrapper** - Mutex-protected database for future parallelization ✅ **Flexible data generation** - Abstract IDataGenerator interface supporting both mock and LLM modes ✅ **Comprehensive CLI** - Boost.Program_options with sensible defaults ✅ **Production-grade logging** - spdlog integration for structured output ✅ **Build quality** - CMake with clang-format/clang-tidy integration ### Architecture Patterns - **Interface-based design**: `IWebClient`, `IDataGenerator` abstract base classes enable substitution and testing - **Dependency injection**: Components receive dependencies via constructors (BiergartenDataGenerator) - **RAII principle**: SQLite connections and resources managed via destructors - **Callback-driven parsing**: Boost.JSON parser emits events to processing callbacks - **Transaction-scoped inserts**: BeginTransaction/CommitTransaction for batch performance ### External Dependencies | Dependency | Version | Purpose | Type | | ---------- | ------- | ---------------------------------- | ------- | | Boost | 1.75+ | JSON parsing, CLI argument parsing | Library | | SQLite3 | - | Persistent data storage | System | | libcurl | - | HTTP downloads | System | | spdlog | v1.11.0 | Structured logging | Fetched | | llama.cpp | b8611 | LLM inference engine | Fetched | to validate formatting without modifying files. clang-tidy runs automatically on the biergarten-pipeline target when available. You can disable it at configure time: cmake -DENABLE_CLANG_TIDY=OFF .. You can also disable format helper targets: cmake -DENABLE_CLANG_FORMAT_TARGETS=OFF ..