mirror of
https://github.com/aaronpo97/the-biergarten-app.git
synced 2026-06-01 01:54:00 +00:00
update readme
This commit is contained in:
@@ -1,406 +1,84 @@
|
|||||||
# Biergarten Pipeline
|
# Biergarten Pipeline
|
||||||
|
|
||||||
A high-performance C++23 data pipeline for fetching, parsing, and storing geographic data (countries, states, cities) with brewery metadata generation capabilities. The system supports both mock and LLM-based (llama.cpp) generation modes.
|
A C++23 tool for processing geographic data and generating brewery metadata. It utilizes a local city manifest, parallel Wikipedia enrichment via `std::async`, and local LLM inference via llama.cpp.
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
The pipeline orchestrates **four key stages**:
|
The pipeline runs in four stages:
|
||||||
|
|
||||||
1. **Download** - Fetches `countries+states+cities.json` from a pinned GitHub commit with optional local filesystem caching
|
- **Query**: Loads and samples from a local `locations.json` manifest.
|
||||||
2. **Parse** - Streams JSON using Boost.JSON's `basic_parser` to extract country/state/city records without loading the entire file into memory
|
- **Enrich**: Fetches regional and cultural context from Wikipedia in parallel using `std::async`.
|
||||||
3. **Store** - Inserts records into a file-based SQLite database with all operations performed sequentially in a single thread
|
- **Generate**: Creates authentic brewery names and descriptions using a local GGUF model or a deterministic mock.
|
||||||
4. **Generate** - Produces brewery metadata or user profiles (mock implementation; supports future LLM integration via llama.cpp)
|
- **Log**: Outputs results and metadata summaries via spdlog.
|
||||||
|
|
||||||
## System Architecture
|
## Implementation Details
|
||||||
|
|
||||||
### Data Sources and Formats
|
### Concurrency
|
||||||
|
|
||||||
- **Hierarchical Structure**: Countries array → states per country → cities per state
|
- **Async Enrichment**: Wikipedia API lookups are parallelized using `std::async`. Each city is processed in its own thread to hide network latency.
|
||||||
- **Data Fields**:
|
- **RAII**: Resource management for libcurl handles and llama.cpp weights is handled via constructors/destructors to ensure clean teardown.
|
||||||
- `id` (integer)
|
|
||||||
- `name` (string)
|
|
||||||
- `iso2` / `iso3` (ISO country/state codes)
|
|
||||||
- `latitude` / `longitude` (geographic coordinates)
|
|
||||||
- **Source**: [dr5hn/countries-states-cities-database](https://github.com/dr5hn/countries-states-cities-database) on GitHub
|
|
||||||
- **Output**: Structured SQLite file-based database (`biergarten-pipeline.db`) + structured logging via spdlog
|
|
||||||
|
|
||||||
### Concurrency Model
|
### LLM Logic
|
||||||
|
|
||||||
The pipeline currently operates **single-threaded** with sequential stage execution:
|
- **Retries**: Includes a 3-attempt loop with automated error correction. If the model returns invalid JSON, the specific error is fed back into the next prompt.
|
||||||
|
- **Context Injection**: Wikipedia summaries are injected into the LLM system prompt to ensure descriptions are grounded in actual regional beer culture.
|
||||||
|
- **Sampling**: Temperature, top-p, and seeds are configurable via the CLI.
|
||||||
|
|
||||||
1. **Download Phase**: Main thread blocks while downloading the source JSON file (if not in cache)
|
## Hardware & GPU Config
|
||||||
2. **Parse & Store Phase**: Main thread performs streaming JSON parse with immediate SQLite inserts
|
|
||||||
|
|
||||||
**Thread Safety**: While single-threaded, the `SqliteDatabase` component is **mutex-protected** using `std::mutex` (`dbMutex`) for all database operations. This design enables safe future parallelization without code modifications.
|
### Test Machine
|
||||||
|
|
||||||
|
- **Host**: ThinkPad P1 Gen 7 (Fedora 43)
|
||||||
|
- **CPU**: Intel Core Ultra 7 155H
|
||||||
|
- **GPU**: NVIDIA RTX 2000 Ada Generation
|
||||||
|
- **Memory**: 32GB
|
||||||
|
- **Model**: Qwen3-8B-Q6-K
|
||||||
|
- **Inference**: llama.cpp with CUDA 12.x support
|
||||||
|
|
||||||
|
### GPU Build Flags
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89 ..
|
||||||
|
cmake --build . --config Release
|
||||||
|
```
|
||||||
|
|
||||||
## Core Components
|
## Core Components
|
||||||
|
|
||||||
| Component | Purpose | Thread Safety | Dependencies |
|
| Component | Function |
|
||||||
| ----------------------------- | ----------------------------------------------------------------------------------------------- | -------------------------------------------- | --------------------------------------------- |
|
| ----------------------- | ----------------------------------------------------------------- |
|
||||||
| **BiergartenDataGenerator** | Orchestrates pipeline execution; manages lifecycle of downloader, parser, and generator | Single-threaded coordinator | ApplicationOptions, WebClient, SqliteDatabase |
|
| BiergartenDataGenerator | Orchestrates the sampling, enrichment, and generation stages. |
|
||||||
| **DataDownloader** | HTTP fetch with curl; optional filesystem cache; ETag support and retries | Blocking I/O; safe for startup | IWebClient, filesystem |
|
| WikipediaService | Fetches and caches summaries for cities and regional beer styles. |
|
||||||
| **StreamingJsonParser** | Extends `boost::json::basic_parser`; emits country/state/city via callbacks; tracks parse depth | Single-threaded parse; callbacks thread-safe | Boost.JSON |
|
| LlamaGenerator | Handles local GGUF inference and output validation. |
|
||||||
| **JsonLoader** | Wraps parser; dispatches callbacks for country/state/city; manages WorkQueue lifecycle | Produces to WorkQueue; safe callbacks | StreamingJsonParser, SqliteDatabase |
|
| JsonLoader | Parses the local `locations.json` file into internal structures. |
|
||||||
| **SqliteDatabase** | Manages schema initialization; insert/query methods for geographic data | Mutex-guarded all operations | SQLite3 |
|
| CURLWebClient | libcurl wrapper for parallel Wikipedia API requests. |
|
||||||
| **IDataGenerator** (Abstract) | Interface for brewery/user metadata generation | Stateless virtual methods | N/A |
|
|
||||||
| **LlamaGenerator** | LLM-based generation via llama.cpp; configurable sampling (temperature, top-p, seed) | Manages llama_model* and llama_context* | llama.cpp, BreweryResult, UserResult |
|
|
||||||
| **MockGenerator** | Deterministic mock generation using seeded randomization | Stateless; thread-safe | N/A |
|
|
||||||
| **CURLWebClient** | HTTP client adapter; URL encoding; file downloads | cURL library bindings | libcurl |
|
|
||||||
| **WikipediaService** | (Planned) Wikipedia data lookups for enrichment | N/A | IWebClient |
|
|
||||||
|
|
||||||
## Database Schema
|
## CLI Options
|
||||||
|
|
||||||
SQLite file-based database with **three core tables** and **indexes for fast lookups**:
|
```
|
||||||
|
./biergarten-pipeline --model ./path/to/model.gguf [options]
|
||||||
### Countries
|
|
||||||
|
|
||||||
```sql
|
|
||||||
CREATE TABLE countries (
|
|
||||||
id INTEGER PRIMARY KEY,
|
|
||||||
name TEXT NOT NULL,
|
|
||||||
iso2 TEXT,
|
|
||||||
iso3 TEXT
|
|
||||||
);
|
|
||||||
CREATE INDEX idx_countries_iso2 ON countries(iso2);
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### States
|
| Flag | Description |
|
||||||
|
| --------------- | ----------------------------------------------- |
|
||||||
|
| `--mocked` | Use deterministic mock data instead of an LLM. |
|
||||||
|
| `--model`, `-m` | Path to the GGUF file. |
|
||||||
|
| `--temperature` | Model temperature (0.0 - 1.0). |
|
||||||
|
| `--n-ctx` | Context window size (default: 8192). |
|
||||||
|
| `--cache-dir` | Directory containing the `locations.json` file. |
|
||||||
|
|
||||||
```sql
|
## Building
|
||||||
CREATE TABLE states (
|
|
||||||
id INTEGER PRIMARY KEY,
|
|
||||||
country_id INTEGER NOT NULL,
|
|
||||||
name TEXT NOT NULL,
|
|
||||||
iso2 TEXT,
|
|
||||||
FOREIGN KEY (country_id) REFERENCES countries(id)
|
|
||||||
);
|
|
||||||
CREATE INDEX idx_states_country ON states(country_id);
|
|
||||||
```
|
|
||||||
|
|
||||||
### Cities
|
### Requirements
|
||||||
|
|
||||||
```sql
|
- C++23 compiler (GCC 13+ / Clang 16+)
|
||||||
CREATE TABLE cities (
|
- CMake 3.20+
|
||||||
id INTEGER PRIMARY KEY,
|
- Boost (JSON, Program_options), libcurl
|
||||||
state_id INTEGER NOT NULL,
|
- CUDA Toolkit 12.x (optional for GPU)
|
||||||
country_id INTEGER NOT NULL,
|
|
||||||
name TEXT NOT NULL,
|
|
||||||
latitude REAL,
|
|
||||||
longitude REAL,
|
|
||||||
FOREIGN KEY (state_id) REFERENCES states(id),
|
|
||||||
FOREIGN KEY (country_id) REFERENCES countries(id)
|
|
||||||
);
|
|
||||||
CREATE INDEX idx_cities_state ON cities(state_id);
|
|
||||||
CREATE INDEX idx_cities_country ON cities(country_id);
|
|
||||||
```
|
|
||||||
|
|
||||||
## Architecture Diagram
|
### Steps
|
||||||
|
|
||||||
```plantuml
|
|
||||||
@startuml biergarten-pipeline
|
|
||||||
!theme plain
|
|
||||||
skinparam monochrome true
|
|
||||||
skinparam classBackgroundColor #FFFFFF
|
|
||||||
skinparam classBorderColor #000000
|
|
||||||
|
|
||||||
package "Application Layer" {
|
|
||||||
class BiergartenDataGenerator {
|
|
||||||
- options: ApplicationOptions
|
|
||||||
- webClient: IWebClient
|
|
||||||
- database: SqliteDatabase
|
|
||||||
- generator: IDataGenerator
|
|
||||||
--
|
|
||||||
+ Run() : int
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
package "Data Acquisition" {
|
|
||||||
class DataDownloader {
|
|
||||||
- webClient: IWebClient
|
|
||||||
--
|
|
||||||
+ Download(url: string, filePath: string)
|
|
||||||
+ DownloadWithCache(url: string, cachePath: string)
|
|
||||||
}
|
|
||||||
|
|
||||||
interface IWebClient {
|
|
||||||
+ DownloadToFile(url: string, filePath: string)
|
|
||||||
+ Get(url: string) : string
|
|
||||||
+ UrlEncode(value: string) : string
|
|
||||||
}
|
|
||||||
|
|
||||||
class CURLWebClient {
|
|
||||||
- globalState: CurlGlobalState
|
|
||||||
--
|
|
||||||
+ DownloadToFile(url: string, filePath: string)
|
|
||||||
+ Get(url: string) : string
|
|
||||||
+ UrlEncode(value: string) : string
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
package "JSON Processing" {
|
|
||||||
class StreamingJsonParser {
|
|
||||||
- depth: int
|
|
||||||
--
|
|
||||||
+ on_object_begin()
|
|
||||||
+ on_object_end()
|
|
||||||
+ on_array_begin()
|
|
||||||
+ on_array_end()
|
|
||||||
+ on_key(str: string)
|
|
||||||
+ on_string(str: string)
|
|
||||||
+ on_number(value: int)
|
|
||||||
}
|
|
||||||
|
|
||||||
class JsonLoader {
|
|
||||||
--
|
|
||||||
+ LoadWorldCities(jsonPath: string, db: SqliteDatabase)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
package "Data Storage" {
|
|
||||||
class SqliteDatabase {
|
|
||||||
- db: sqlite3*
|
|
||||||
- dbMutex: std::mutex
|
|
||||||
--
|
|
||||||
+ Initialize(dbPath: string)
|
|
||||||
+ InsertCountry(id: int, name: string, iso2: string, iso3: string)
|
|
||||||
+ InsertState(id: int, countryId: int, name: string, iso2: string)
|
|
||||||
+ InsertCity(id: int, stateId: int, countryId: int, name: string, lat: double, lon: double)
|
|
||||||
+ QueryCountries(limit: int) : vector<Country>
|
|
||||||
+ QueryStates(limit: int) : vector<State>
|
|
||||||
+ QueryCities() : vector<City>
|
|
||||||
+ BeginTransaction()
|
|
||||||
+ CommitTransaction()
|
|
||||||
# InitializeSchema()
|
|
||||||
}
|
|
||||||
|
|
||||||
struct Country {
|
|
||||||
id: int
|
|
||||||
name: string
|
|
||||||
iso2: string
|
|
||||||
iso3: string
|
|
||||||
}
|
|
||||||
|
|
||||||
struct State {
|
|
||||||
id: int
|
|
||||||
name: string
|
|
||||||
iso2: string
|
|
||||||
countryId: int
|
|
||||||
}
|
|
||||||
|
|
||||||
struct City {
|
|
||||||
id: int
|
|
||||||
name: string
|
|
||||||
countryId: int
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
package "Data Generation" {
|
|
||||||
interface IDataGenerator {
|
|
||||||
+ load(modelPath: string)
|
|
||||||
+ generateBrewery(cityName: string, countryName: string, regionContext: string) : BreweryResult
|
|
||||||
+ generateUser(locale: string) : UserResult
|
|
||||||
}
|
|
||||||
|
|
||||||
class LlamaGenerator {
|
|
||||||
- model: llama_model*
|
|
||||||
- context: llama_context*
|
|
||||||
- sampling_temperature: float
|
|
||||||
- sampling_top_p: float
|
|
||||||
- sampling_seed: uint32_t
|
|
||||||
--
|
|
||||||
+ load(modelPath: string)
|
|
||||||
+ generateBrewery(...) : BreweryResult
|
|
||||||
+ generateUser(locale: string) : UserResult
|
|
||||||
+ setSamplingOptions(temperature: float, topP: float, seed: int)
|
|
||||||
# infer(prompt: string) : string
|
|
||||||
}
|
|
||||||
|
|
||||||
class MockGenerator {
|
|
||||||
--
|
|
||||||
+ load(modelPath: string)
|
|
||||||
+ generateBrewery(...) : BreweryResult
|
|
||||||
+ generateUser(locale: string) : UserResult
|
|
||||||
}
|
|
||||||
|
|
||||||
struct BreweryResult {
|
|
||||||
name: string
|
|
||||||
description: string
|
|
||||||
}
|
|
||||||
|
|
||||||
struct UserResult {
|
|
||||||
username: string
|
|
||||||
bio: string
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
package "Enrichment (Planned)" {
|
|
||||||
class WikipediaService {
|
|
||||||
- webClient: IWebClient
|
|
||||||
--
|
|
||||||
+ SearchCity(cityName: string, countryName: string) : string
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
' Relationships
|
|
||||||
BiergartenDataGenerator --> DataDownloader
|
|
||||||
BiergartenDataGenerator --> JsonLoader
|
|
||||||
BiergartenDataGenerator --> SqliteDatabase
|
|
||||||
BiergartenDataGenerator --> IDataGenerator
|
|
||||||
|
|
||||||
DataDownloader --> IWebClient
|
|
||||||
CURLWebClient ..|> IWebClient
|
|
||||||
|
|
||||||
JsonLoader --> StreamingJsonParser
|
|
||||||
JsonLoader --> SqliteDatabase
|
|
||||||
|
|
||||||
LlamaGenerator ..|> IDataGenerator
|
|
||||||
MockGenerator ..|> IDataGenerator
|
|
||||||
|
|
||||||
SqliteDatabase --> Country
|
|
||||||
SqliteDatabase --> State
|
|
||||||
SqliteDatabase --> City
|
|
||||||
|
|
||||||
LlamaGenerator --> BreweryResult
|
|
||||||
LlamaGenerator --> UserResult
|
|
||||||
MockGenerator --> BreweryResult
|
|
||||||
MockGenerator --> UserResult
|
|
||||||
|
|
||||||
WikipediaService --> IWebClient
|
|
||||||
|
|
||||||
@enduml
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration and Extensibility
|
|
||||||
|
|
||||||
### Command-Line Arguments
|
|
||||||
|
|
||||||
Boost.Program_options provides named CLI arguments. Running without arguments displays usage instructions.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./biergarten-pipeline [options]
|
mkdir build && cd build
|
||||||
```
|
|
||||||
|
|
||||||
**Requirement**: Exactly one of `--mocked` or `--model` must be specified.
|
|
||||||
|
|
||||||
| Argument | Short | Type | Purpose |
|
|
||||||
| --------------- | ----- | ------ | --------------------------------------------------------------- |
|
|
||||||
| `--mocked` | - | flag | Use mocked generator for brewery/user data |
|
|
||||||
| `--model` | `-m` | string | Path to LLM model file (gguf); mutually exclusive with --mocked |
|
|
||||||
| `--cache-dir` | `-c` | path | Directory for cached JSON (default: `/tmp`) |
|
|
||||||
| `--temperature` | - | float | LLM sampling temperature 0.0-1.0 (default: `0.8`) |
|
|
||||||
| `--top-p` | - | float | Nucleus sampling parameter 0.0-1.0 (default: `0.92`) |
|
|
||||||
| `--seed` | - | int | Random seed: -1 for random (default: `-1`) |
|
|
||||||
| `--help` | `-h` | flag | Show help message |
|
|
||||||
|
|
||||||
**Note**: The data source is always pinned to commit `c5eb7772` (stable 2026-03-28) and cannot be changed.
|
|
||||||
|
|
||||||
**Note**: When `--mocked` is used, any sampling parameters (`--temperature`, `--top-p`, `--seed`) are ignored with a warning.
|
|
||||||
|
|
||||||
### Usage Examples
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Mocked generator (deterministic, no LLM required)
|
|
||||||
./biergarten-pipeline --mocked
|
|
||||||
|
|
||||||
# With LLM model
|
|
||||||
./biergarten-pipeline --model ./models/llama.gguf --cache-dir /var/cache
|
|
||||||
|
|
||||||
# Mocked with extra parameters provided (will be ignored with warning)
|
|
||||||
./biergarten-pipeline --mocked --temperature 0.5 --top-p 0.8 --seed 42
|
|
||||||
|
|
||||||
# Show help
|
|
||||||
./biergarten-pipeline --help
|
|
||||||
```
|
|
||||||
|
|
||||||
## Building and Running
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
- **C++23 compiler** (g++, clang, MSVC)
|
|
||||||
- **CMake** 3.20+
|
|
||||||
- **curl** (for HTTP downloads)
|
|
||||||
- **sqlite3** (database backend)
|
|
||||||
- **Boost** 1.75+ (requires Boost.JSON and Boost.Program_options)
|
|
||||||
- **spdlog** v1.11.0 (fetched via CMake FetchContent)
|
|
||||||
- **llama.cpp** (fetched via CMake FetchContent for LLM inference)
|
|
||||||
|
|
||||||
### Build
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mkdir -p build
|
|
||||||
cd build
|
|
||||||
cmake ..
|
cmake ..
|
||||||
cmake --build . --target biergarten-pipeline -- -j
|
cmake --build . -j$(nproc)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Run
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./build/biergarten-pipeline
|
|
||||||
```
|
|
||||||
|
|
||||||
**Output**:
|
|
||||||
|
|
||||||
- Console logs with structured spdlog output
|
|
||||||
- Cached JSON file: `/tmp/countries+states+cities.json`
|
|
||||||
- SQLite database: `biergarten-pipeline.db` (in output directory)
|
|
||||||
|
|
||||||
## Code Quality and Static Analysis
|
|
||||||
|
|
||||||
### Formatting
|
|
||||||
|
|
||||||
This project uses **clang-format** with the **Google C++ style guide**:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Apply formatting to all source files
|
|
||||||
cmake --build build --target format
|
|
||||||
|
|
||||||
# Check formatting without modifications
|
|
||||||
cmake --build build --target format-check
|
|
||||||
```
|
|
||||||
|
|
||||||
### Static Analysis
|
|
||||||
|
|
||||||
This project uses **clang-tidy** with configurations for Google, modernize, performance, and bug-prone rules (`.clang-tidy`):
|
|
||||||
|
|
||||||
Static analysis runs automatically during compilation if `clang-tidy` is available.
|
|
||||||
|
|
||||||
## Code Implementation Summary
|
|
||||||
|
|
||||||
### Key Achievements
|
|
||||||
|
|
||||||
✅ **Full pipeline implementation** - Download → Parse → Store → Generate
|
|
||||||
✅ **Streaming JSON parser** - Memory-efficient processing via Boost.JSON callbacks
|
|
||||||
✅ **Thread-safe SQLite wrapper** - Mutex-protected database for future parallelization
|
|
||||||
✅ **Flexible data generation** - Abstract IDataGenerator interface supporting both mock and LLM modes
|
|
||||||
✅ **Comprehensive CLI** - Boost.Program_options with sensible defaults
|
|
||||||
✅ **Production-grade logging** - spdlog integration for structured output
|
|
||||||
✅ **Build quality** - CMake with clang-format/clang-tidy integration
|
|
||||||
|
|
||||||
### Architecture Patterns
|
|
||||||
|
|
||||||
- **Interface-based design**: `IWebClient`, `IDataGenerator` abstract base classes enable substitution and testing
|
|
||||||
- **Dependency injection**: Components receive dependencies via constructors (BiergartenDataGenerator)
|
|
||||||
- **RAII principle**: SQLite connections and resources managed via destructors
|
|
||||||
- **Callback-driven parsing**: Boost.JSON parser emits events to processing callbacks
|
|
||||||
- **Transaction-scoped inserts**: BeginTransaction/CommitTransaction for batch performance
|
|
||||||
|
|
||||||
### External Dependencies
|
|
||||||
|
|
||||||
| Dependency | Version | Purpose | Type |
|
|
||||||
| ---------- | ------- | ---------------------------------- | ------- |
|
|
||||||
| Boost | 1.75+ | JSON parsing, CLI argument parsing | Library |
|
|
||||||
| SQLite3 | - | Persistent data storage | System |
|
|
||||||
| libcurl | - | HTTP downloads | System |
|
|
||||||
| spdlog | v1.11.0 | Structured logging | Fetched |
|
|
||||||
| llama.cpp | b8611 | LLM inference engine | Fetched |
|
|
||||||
|
|
||||||
to validate formatting without modifying files.
|
|
||||||
|
|
||||||
clang-tidy runs automatically on the biergarten-pipeline target when available. You can disable it at configure time:
|
|
||||||
|
|
||||||
cmake -DENABLE_CLANG_TIDY=OFF ..
|
|
||||||
|
|
||||||
You can also disable format helper targets:
|
|
||||||
|
|
||||||
cmake -DENABLE_CLANG_FORMAT_TARGETS=OFF ..
|
|
||||||
|
|||||||
Reference in New Issue
Block a user