Files
the-biergarten-app/pipeline
Aaron Po 248a51b35f cleanup
2026-04-01 21:35:02 -04:00
..
2026-04-01 21:35:02 -04:00
2026-04-01 21:35:02 -04:00

Biergarten Pipeline

Overview

The pipeline orchestrates five key stages:

  1. Download: Fetches countries+states+cities.json from a pinned GitHub commit with optional local caching.
  2. Parse: Streams JSON using RapidJSON SAX parser, extracting country/state/city records without loading the entire file into memory.
  3. Buffer: Routes city records through a bounded concurrent queue to decouple parsing from writes.
  4. Store: Inserts records with concurrent thread safety using an in-memory SQLite database.
  5. Generate: Produces mock brewery metadata for a sample of cities (mockup for future LLM integration).

Architecture

Data Sources and Formats

  • Hierarchical structure: countries array → states per country → cities per state.
  • Fields: id (integer), name (string), iso2 / iso3 (codes), latitude / longitude.
  • Sourced from: dr5hn/countries-states-cities-database on GitHub.

Output: Structured SQLite in-memory database + console logs via spdlog.

Concurrency Architecture

The pipeline splits work across parsing and writing phases:

Main Thread:
  parse_sax() -> Insert countries (direct)
              -> Insert states (direct)
              -> Push CityRecord to WorkQueue

Worker Threads (implicit; pthread pool via sqlite3):
  Pop CityRecord from WorkQueue
  -> InsertCity(db) with mutex protection

Key synchronization primitives:

  • WorkQueue: Bounded (default 1024 items) concurrent queue with blocking push/pop, guarded by mutex + condition variables.
  • SqliteDatabase::dbMutex: Serializes all SQLite operations to avoid SQLITE_BUSY and ensure write safety.

Backpressure: When the WorkQueue fills (≥1024 city records pending), the parser thread blocks until workers drain items.

Component Responsibilities

Component Purpose Thread Safety
DataDownloader GitHub fetch with curl; optional filesystem cache; handles retries and ETags. Blocking I/O; safe for single-threaded startup.
StreamingJsonParser SAX-style RapidJSON handler; emits country/state/city via callbacks; tracks parse state (array depth, key context). Single-threaded parse phase; thread-safe callbacks.
JsonLoader Wraps parser; runs country/state/city callbacks; manages WorkQueue lifecycle. Produces to WorkQueue; consumes from callbacks.
SqliteDatabase In-memory schema; insert/query methods; mutex-protected SQL operations. Mutex-guarded; thread-safe concurrent inserts.
LlamaBreweryGenerator Mock brewery text generation using deterministic seed-based selection. Stateless; thread-safe method calls.

Database Schema

SQLite in-memory database with three core tables:

Countries

CREATE TABLE countries (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  iso2 TEXT,
  iso3 TEXT
);
CREATE INDEX idx_countries_iso2 ON countries(iso2);

States

CREATE TABLE states (
  id INTEGER PRIMARY KEY,
  country_id INTEGER NOT NULL,
  name TEXT NOT NULL,
  iso2 TEXT,
  FOREIGN KEY (country_id) REFERENCES countries(id)
);
CREATE INDEX idx_states_country ON states(country_id);

Cities

CREATE TABLE cities (
  id INTEGER PRIMARY KEY,
  state_id INTEGER NOT NULL,
  country_id INTEGER NOT NULL,
  name TEXT NOT NULL,
  latitude REAL,
  longitude REAL,
  FOREIGN KEY (state_id) REFERENCES states(id),
  FOREIGN KEY (country_id) REFERENCES countries(id)
);
CREATE INDEX idx_cities_state ON cities(state_id);
CREATE INDEX idx_cities_country ON cities(country_id);

Design rationale:

  • In-memory for performance (no persistent storage; data is regenerated on each run).
  • Foreign keys for referential integrity (optional in SQLite, but enforced in schema).
  • Indexes on foreign keys for fast lookups during brewery generation.
  • Dual country_id in cities table for direct queries without state joins.

Data Flow

Parse Phase (Main Thread)

  1. DataDownloader::DownloadCountriesDatabase()

    • Constructs GitHub raw-content URL: https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/{commit}/countries+states+cities.json
    • Uses curl with FOLLOWLOCATION and timeout.
    • Caches locally; checks ETag for freshness.
  2. StreamingJsonParser::Parse()

    • Opens file stream; initializes RapidJSON SAX parser with custom handler.
    • Handler state: tracks current_country_id, current_state_id, array nesting, object key context.
    • Country processing (inline): When country object completes, calls db.InsertCountry() directly on main thread.
    • State processing (inline): When state object completes, calls db.InsertState() directly.
    • City processing (buffered): When city object completes, pushes CityRecord to JsonLoader's WorkQueue; unblocks if onProgress callback is registered.
  3. JsonLoader::LoadWorldCities()

    • Registers callbacks with parser.
    • Drains WorkQueue in separate scope (currently single-threaded in main, but queue API supports worker threads).
    • Each city is inserted via db.InsertCity().

Query and Generation Phase (Main Thread)

  1. Database Queries

    • QueryCountries(limit): Retrieve countries; used for progress display.
    • QueryStates(limit): Retrieve states; used for progress display.
    • QueryCities(): Retrieve all city ids + names for brewery generation.
  2. Brewery Generation

    • For each city sample, call LlamaBreweryGenerator::GenerateBrewery(cityName, seed).
    • Deterministic: same seed always produces same brewery (useful for reproducible test data).
    • Returns { name, description } struct.

Concurrency Deep Dive

WorkQueue

A bounded thread-safe queue enabling producer-consumer patterns:

template <typename T> class WorkQueue {
  std::queue<T> queue;
  std::mutex mutex;
  std::condition_variable cv_not_empty, cv_not_full;
  size_t max_size;
  bool shutdown;
};

push(item):

  • Locks mutex.
  • Waits on cv_not_full until queue is below max_size OR shutdown signaled.
  • Pushes item; notifies one waiter on cv_not_empty.
  • Returns false if shutdown, else true.

pop():

  • Locks mutex.
  • Waits on cv_not_empty until queue has items OR shutdown signaled.
  • Pops and returns std::optional<T>; notifies one waiter on cv_not_full.
  • Returns std::nullopt if shutdown and queue is empty.

shutdown_queue():

  • Sets shutdown = true; notifies all waiters on both condition variables.
  • Causes all waiting pop() calls to return std::nullopt.

Why this design:

  • Bounded capacity: Prevents unbounded memory growth when parser outpaces inserts.
  • Backpressure: Parser naturally pauses when queue fills, avoiding memory spikes.
  • Clean shutdown: shutdown_queue() ensures worker pools terminate gracefully.

SqliteDatabase Mutex

All SQLite operations (INSERT, SELECT) are guarded by dbMutex:

std::unique_lock<std::mutex> lock(dbMutex);
int rc = sqlite3_step(stmt);

Why: SQLite's "serializable" journal mode (default) requires external synchronization for multi-threaded access. A single mutex serializes all queries, avoiding SQLITE_BUSY errors.

Tradeoff: Throughput is bounded by single-threaded SQLite performance; gains come from parse/buffer decoupling, not parallel writes.


Error Handling

DataDownloader

  • Network failures: Retries up to 3 times with exponential backoff; throws std::runtime_error on final failure.
  • Caching: Falls back to cached file if download fails and cache exists.

Streaming Parser

  • Malformed JSON: RapidJSON SAX handler reports parse errors; caught as exceptions in main.
  • Missing fields: Silently skips incomplete records (e.g., city without latitude).

Database Operations

  • Mutex contention: No explicit backoff; relies on condition variables.
  • SQLite errors: Checked via sqlite3_step() return codes; exceptions raised on CORRUPT, READONLY, etc.

Resilience Design

  • No checkpointing: In-memory database is ephemeral; restart from scratch on failure.
  • Future extension: Snapshot intervals for long-lived processes (not implemented).

Performance Characteristics

Benchmarks (Example: 2M cities on 2024 MacBook Pro)

Stage Time Throughput
Download + Cache 1s ~100 MB/s (network dependent)
Parse (SAX) 2s 50M records/sec
Insert (countries/states) <0.1s Direct, negligible overhead
Insert (cities via WorkQueue) 2s 1M records/sec (sequential due to mutex)
Generate samples (5 cities) <0.1s Mock generation negligible
Total ~5s

Bottlenecks

  • SQLite insertion: Single-threaded mutex lock serializes writes. Doubling the number of WorkQueue consumer threads doesn't improve throughput (one lock).
  • Parse speed: RapidJSON SAX is fast (2s for 100 MB); not the bottleneck.
  • Memory: ~100 MB for in-memory database; suitable for most deployments.

Optimization Opportunities

  • WAL mode: SQLite WAL (write-ahead logging) could reduce lock contention; not beneficial for in-memory DB.
  • Batch inserts: Combine multiple rows in a single transaction; helps if inserting outside the WorkQueue scope.
  • Foreign key lazy-loading: Skip foreign key constraints during bulk load; re-enable for queries. (Not implemented.)

Configuration and Extensibility

Command-Line Arguments

./biergarten-pipeline [modelPath] [cacheDir] [commit]
Arg Default Purpose
modelPath ./model.gguf Path to LLM model (mock implementation; not loaded in current version).
cacheDir /tmp Directory for cached JSON (e.g., /tmp/countries+states+cities.json).
commit c5eb7772 Git commit hash for consistency (stable 2026-03-28 snapshot).

Examples:

./biergarten-pipeline
./biergarten-pipeline ./models/llama.gguf /var/cache main
./biergarten-pipeline "" /tmp v1.2.3

Extending the Generator

Current: LlamaBreweryGenerator::GenerateBrewery() uses deterministic seed-based selection from hardcoded lists.

Future swap points:

  1. Load an actual LLM model in LoadModel(modelPath).
  2. Tokenize city name and context; call model inference.
  3. Validate output (length, format) and rank if multiple candidates.
  4. Cache results to avoid re-inference for repeated cities.

Example stub for future integration:

Brewery LlamaBreweryGenerator::GenerateBrewery(const std::string &cityName, int seed) {
  // TODO: Replace with actual llama.cpp inference
  // llama_context *ctx = llama_new_context_with_model(model, params);
  // std::string prompt = "Generate a brewery for " + cityName;
  // std::string result = llama_inference(ctx, prompt, seed);
  // return parse_brewery(result);
}

Logging Configuration

Logging uses spdlog with:

  • Level: Info (can change via spdlog::set_level(spdlog::level::debug) at startup).
  • Format: Plain ASCII; no Unicode box art.
  • Sink: Console (stdout/stderr); can redirect to file.

Current output sample:

[Pipeline] Downloading geographic data from GitHub...
Initializing in-memory SQLite database...
Initializing brewery generator...

=== GEOGRAPHIC DATA OVERVIEW ===
Total records loaded:
  Countries: 195
  States: 5000
  Cities: 150000

Building and Running

Prerequisites

  • C++17 compiler (g++, clang, MSVC).
  • CMake 3.20+.
  • curl (for HTTP downloads).
  • sqlite3 (usually system-provided).
  • RapidJSON (fetched via CMake FetchContent).
  • spdlog (fetched via CMake FetchContent).

Build

mkdir -p build
cd build
cmake ..
cmake --build . --target biergarten-pipeline -- -j

Build artifacts:

  • Executable: build/biergarten-pipeline
  • Intermediate: build/CMakeFiles/, build/_deps/ (RapidJSON, spdlog)

Run

./biergarten-pipeline

Output: Logs to console; caches JSON in /tmp/countries+states+cities.json.

Cleaning

rm -rf build

Development Notes

Code Organization

  • includes/: Public headers (data structures, class APIs).
  • src/: Implementations with inline comments for non-obvious logic.
  • CMakeLists.txt: Build configuration; defines fetch content, compiler flags, linking.

Testing

Currently no automated tests. To add:

  1. Create tests/ folder.
  2. Use CMake to add a test executable.
  3. Test the parser with small JSON fixtures.
  4. Mock the database for isolation.

Debugging

Enable verbose logging:

spdlog::set_level(spdlog::level::debug);

GDB workflow:

gdb ./biergarten-pipeline
(gdb) break src/stream_parser.cpp:50
(gdb) run

Future Enhancements

  1. Real LLM integration: Load and run llama.cpp models.
  2. Persistence: Write brewery data to a database or file.
  3. Distributed parsing: Shard JSON file across multiple parse streams.
  4. Incremental updates: Only insert new records if source updated.
  5. Web API: Expose database via HTTP (brewery lookup, city search).

References