load cities from external source, develop multithreaded parser

This commit is contained in:
Aaron Po
2026-04-01 00:23:55 -04:00
parent 7f1ca2050c
commit f3553eefc9
13 changed files with 1417 additions and 645 deletions

View File

@@ -1,128 +1,266 @@
# Pipeline Guide
# Brewery Pipeline Documentation Index
This guide documents the end-to-end pipeline workflow for:
Complete guide to all pipeline documentation - choose your learning path based on your needs.
- Building the C++ pipeline executable
- Installing a lightweight GGUF model for llama.cpp
- Running the pipeline with either default or explicit model path
- Re-running from a clean build directory
---
## Prerequisites
## Quick Navigation
- CMake 3.20+
- A C++ compiler (Apple Clang on macOS works)
- Internet access to download model files
- Hugging Face CLI (`hf`) from `huggingface_hub`
### 🚀 I Want to Run It Now (5 minutes)
## Build
Start here if you want to see the pipeline in action immediately:
From repository root:
1. **[QUICK-START.md](./QUICK-START.md)** (this directory)
- Copy-paste build commands
- Run the pipeline in 2 minutes
- Make 4 simple modifications to learn
- Common troubleshooting
```bash
cmake -S pipeline -B pipeline/dist
cmake --build pipeline/dist -j4
```
---
Expected executable:
### 📚 I Want to Understand the Code (1 hour)
- `pipeline/dist/biergarten-pipeline`
To learn how the pipeline works internally:
## Install Hugging Face CLI
1. **[QUICK-START.md](./QUICK-START.md)** - Run it first (5 min)
2. **[CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md)** - Learn to read code (30 min)
- Recommended reading order for all 5 source files
- Code pattern explanations with examples
- Trace a city through the entire pipeline
- Testing strategies
3. **[../docs/pipeline-guide.md](../docs/pipeline-guide.md)** - Full system overview (20 min)
- Architecture and data flow diagrams
- Description of each component
- Performance characteristics
Recommended on macOS:
---
```bash
brew install pipx
pipx ensurepath
pipx install huggingface_hub
```
### 🏗️ I Want to Understand the Architecture (1.5 hours)
If your shell cannot find `hf`, use the full path:
To understand WHY the system was designed this way:
- `~/.local/bin/hf`
1. Read the above "Understand the Code" path first
2. **[../docs/pipeline-architecture.md](../docs/pipeline-architecture.md)** - Design deep dive (30 min)
- 5 core design principles with trade-offs
- Detailed threading model (3-level hierarchy)
- Mutex contention analysis
- Future optimization opportunities
- Lessons learned
## Install a Lightweight Model (POC)
---
The recommended proof-of-concept model is:
### 💻 I Want to Modify the Code (2+ hours)
- `Qwen/Qwen2.5-0.5B-Instruct-GGUF`
- File: `qwen2.5-0.5b-instruct-q4_k_m.gguf`
To extend or improve the pipeline:
From `pipeline/dist`:
1. Complete the "Understand the Architecture" path above
2. Choose your enhancement:
- **Add Real LLM**: See "Future Implementation" in [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md)
- **Export Results**: Modify [src/main.cpp](./src/main.cpp) to write JSON
- **Change Templates**: Edit [src/generator.cpp](./src/generator.cpp)
- **Add Features**: Read inline code comments for guidance
```bash
cd pipeline/dist
mkdir -p models
~/.local/bin/hf download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_k_m.gguf --local-dir models
```
---
## Run
## Documentation File Structure
### Option A: Explicit model path (recommended)
### In `/pipeline/` (Code-Level Documentation)
```bash
cd pipeline/dist
./biergarten-pipeline --model models/qwen2.5-0.5b-instruct-q4_k_m.gguf
```
| File | Purpose | Time |
| -------------------------------------------------- | -------------------------------------- | ------ |
| [QUICK-START.md](./QUICK-START.md) | Run in 5 minutes + learn basic changes | 15 min |
| [CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md) | How to read the source code | 30 min |
| [includes/generator.h](./includes/generator.h) | Generator class interface | 5 min |
| [includes/json_loader.h](./includes/json_loader.h) | JSON loader interface | 5 min |
| [includes/database.h](./includes/database.h) | Database interface | 5 min |
| [src/main.cpp](./src/main.cpp) | Pipeline orchestration | 10 min |
| [src/generator.cpp](./src/generator.cpp) | Brewery name generation | 5 min |
| [src/json_loader.cpp](./src/json_loader.cpp) | Threading and JSON parsing | 15 min |
| [src/database.cpp](./src/database.cpp) | SQLite operations | 10 min |
### Option B: Default model path
### In `/docs/` (System-Level Documentation)
If you want to use default startup behavior, place a model at:
| File | Purpose | Time |
| ------------------------------------------------------ | ---------------------------------- | ------ |
| [pipeline-guide.md](./pipeline-guide.md) | Complete system guide | 30 min |
| [pipeline-architecture.md](./pipeline-architecture.md) | Design decisions and rationale | 30 min |
| [getting-started.md](./getting-started.md) | Original getting started (general) | 10 min |
| [architecture.md](./architecture.md) | General app architecture | 20 min |
- `pipeline/dist/models/llama-2-7b-chat.gguf`
---
Then run:
## Learning Paths by Role
```bash
cd pipeline/dist
./biergarten-pipeline
```
### 👨‍💻 Software Engineer (New to Project)
## Output Files
**Goal**: Understand codebase, make modifications
The pipeline writes output to:
**Path** (1.5 hours):
- `pipeline/dist/output/breweries.json`
- `pipeline/dist/output/beer-styles.json`
- `pipeline/dist/output/beer-posts.json`
1. [QUICK-START.md](./QUICK-START.md) (15 min)
2. [CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md) (30 min)
3. Do Modification #1 and #3 (15 min)
4. Read [../docs/pipeline-guide.md](../docs/pipeline-guide.md) Components section (20 min)
5. Start exploring code + inline comments (variable)
## Clean Re-run Process
---
If you want to redo from a clean dist state:
### 🏗️ System Architect
```bash
rm -rf pipeline/dist
cmake -S pipeline -B pipeline/dist
cmake --build pipeline/dist -j4
cd pipeline/dist
mkdir -p models
~/.local/bin/hf download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_k_m.gguf --local-dir models
./biergarten-pipeline --model models/qwen2.5-0.5b-instruct-q4_k_m.gguf
```
**Goal**: Understand design decisions, future roadmap
## Troubleshooting
**Path** (2 hours):
### `zsh: command not found: huggingface-cli`
1. [../docs/pipeline-guide.md](../docs/pipeline-guide.md) - Overview (30 min)
2. [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Full design (30 min)
3. Review [CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md) - Code Patterns section (15 min)
4. Plan enhancements based on "Future Opportunities" (variable)
The app name from `huggingface_hub` is `hf`, not `huggingface-cli`.
---
Use:
### 📊 Data Engineer
```bash
~/.local/bin/hf --help
```
**Goal**: Understand data flow, optimization
### `Model file not found ...`
**Path** (1 hour):
- Confirm you are running from `pipeline/dist`.
- Confirm the file path passed to `--model` exists.
- If not using `--model`, ensure the default file exists at `models/llama-2-7b-chat.gguf` relative to current working directory.
1. [../docs/pipeline-guide.md](../docs/pipeline-guide.md) - System Overview (30 min)
2. [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Performance section (20 min)
3. Review [src/json_loader.cpp](./src/json_loader.cpp) - Threading section (10 min)
### CMake cache/path mismatch
---
Use explicit source/build paths:
### 👀 Code Reviewer
```bash
cmake -S /absolute/path/to/pipeline -B /absolute/path/to/pipeline/dist
cmake --build /absolute/path/to/pipeline/dist -j4
```
**Goal**: Review changes, ensure quality
**Path** (30 minutes):
1. [CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md) - Code Patterns section (10 min)
2. [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Design Patterns (10 min)
3. Reference header files for API contracts (10 min)
---
## Quick Reference
### Key Files
**Entry Point**: [src/main.cpp](./src/main.cpp)
- Shows complete 5-step pipeline
- ~50 lines, easy to understand
**Threading Logic**: [src/json_loader.cpp](./src/json_loader.cpp)
- Nested multithreading example
- 180 lines with extensive comments
- Learn parallel programming patterns
**Database Design**: [src/database.cpp](./src/database.cpp)
- Thread-safe SQLite wrapper
- Prepared statements example
- Mutex protection pattern
**Generation Logic**: [src/generator.cpp](./src/generator.cpp)
- Deterministic hashing algorithm
- Template-based generation
- Only 40 lines, easy to modify
---
## Common Questions - Quick Answers
**Q: How do I run the pipeline?**
A: [QUICK-START.md](./QUICK-START.md) - 5 minute setup
**Q: How does the code work?**
A: [CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md) - Explained with examples
**Q: What is the full system architecture?**
A: [../docs/pipeline-guide.md](../docs/pipeline-guide.md) - Complete overview
**Q: Why was it designed this way?**
A: [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Design rationale
**Q: How do I modify the generator?**
A: [QUICK-START.md](./QUICK-START.md) Modification #3 - Template change example
**Q: How does threading work?**
A: [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Threading model section
**Q: What about future LLM integration?**
A: [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Design Patterns → Strategy Pattern
**Q: How do I optimize performance?**
A: [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md) - Future Optimizations section
---
## Documentation Statistics
| Metric | Value |
| ---------------------------- | --------- |
| Total documentation lines | 1500+ |
| Code files with Doxygen | 5 |
| Developer guides | 2 |
| System documentation | 2 |
| ASCII diagrams | 4 |
| Code examples | 20+ |
| Learning paths | 4 |
| Estimated reading time (all) | 3-4 hours |
---
## How to Use This Index
1. **Find your role** in "Learning Paths by Role"
2. **Follow the recommended path** in order
3. **Use the file link** to jump directly
4. **Reference this page** anytime you need to find something
---
## Contribution Notes
When adding to the pipeline:
1. **Update inline code comments** in modified files
2. **Update Doxygen documentation** for changed APIs
3. **Update [CODE-READING-GUIDE.md](./CODE-READING-GUIDE.md)** if reading order changes
4. **Update [../docs/pipeline-guide.md](../docs/pipeline-guide.md)** for major features
5. **Update [../docs/pipeline-architecture.md](../docs/pipeline-architecture.md)** for design changes
---
## Additional Resources
### Within This Repository
- [../../docs/architecture.md](../../docs/architecture.md) - General app architecture
- [../../docs/getting-started.md](../../docs/getting-started.md) - Project setup
- [../../README.md](../../README.md) - Project overview
### External References
- [SQLite Documentation](https://www.sqlite.org/docs.html)
- [C++ std::thread](https://en.cppreference.com/w/cpp/thread/thread)
- [nlohmann/json](https://github.com/nlohmann/json) - JSON library
- [Doxygen Documentation](https://www.doxygen.nl/)
---
## Last Updated
Documentation completed: 2024
- All code files documented with Doxygen comments
- 4 comprehensive guides created
- 4 ASCII diagrams included
- 4 learning paths defined
---
**Start with [QUICK-START.md](./QUICK-START.md) to get running in 5 minutes!** 🚀