Commit Graph

36 Commits

Author SHA1 Message Date
Aaron Po
8c572a2d07 fix: stabilize Gemma 4 brewery generation
remove misleading turn-token output guidance from the brewery prompt
extract the last balanced JSON object before validation
keep README model setup and run instructions aligned
preserve Gemma 4 sampling defaults and local model usage
2026-04-10 22:25:26 -04:00
Aaron Po
902bda6eb9 eat: make Gemma 4 the default model, enable thinking mode 2026-04-10 21:43:18 -04:00
Aaron Po
61d5077a95 update readme 2026-04-10 00:03:45 -04:00
Aaron Po
7ca651a886 updates for gemma-4-E4B-it-Q6_K.gguf 2026-04-09 23:59:38 -04:00
Aaron Po
b53f9e5582 fix: llama backend lifetime, Wikipedia enrichment depth, and misc cleanup 2026-04-09 21:59:46 -04:00
Aaron Po
824f5b2b4f Refactor BiergartenDataGenerator to use dependency injection container 2026-04-09 20:46:20 -04:00
Aaron Po
5d93d76e99 Refactor data generator constructor and update web client handling; enhance README with detailed pipeline overview and class diagram 2026-04-09 18:19:12 -04:00
Aaron Po
028786b8b5 updates 2026-04-09 17:26:49 -04:00
Aaron Po
d7a31b5264 Create one method per file 2026-04-09 17:19:04 -04:00
Aaron Po
b31be494d7 Update documentation 2026-04-08 22:24:23 -04:00
Aaron Po
7807f0bc2a Add beer styles json 2026-04-08 21:26:35 -04:00
Aaron Po
772ef0cdfb Update CMakeLists.txt 2026-04-08 21:25:11 -04:00
Aaron Po
a6e2ea21d0 fix include 2026-04-08 21:24:59 -04:00
Aaron Po
a7cbf7507f fix location.h 2026-04-08 21:07:28 -04:00
Aaron Po
3c7e74e3c1 update readme 2026-04-08 11:27:37 -04:00
Aaron Po
b1ac3a6068 fix: remove outdated data source information from help message 2026-04-07 18:02:21 -04:00
Aaron Po
06d329cac5 refactor 2026-04-07 17:55:15 -04:00
Aaron Po
54c403526b fix: improve error handling and logging in data generation pipeline 2026-04-07 13:36:59 -04:00
Aaron Po
b8e96a6d45 replace SQLite geo pipeline with curated in-memory locations 2026-04-07 02:28:15 -04:00
Aaron Po
60ee2ecf74 add prompts 2026-04-03 15:53:04 -04:00
Aaron Po
e4e16a5084 fix: address critical correctness, reliability, and design issues in pipeline
CORRECTNESS FIXES:
- json_loader: Add RollbackTransaction() and call it on exception instead of
  CommitTransaction(). Prevents partial data corruption on parse/disk errors.
- wikipedia_service: Fix invalid MediaWiki API parameter explaintext=true ->
  explaintext=1. Now returns plain text instead of HTML markup in contexts.
- helpers: Fix ParseTwoLineResponse filter to only remove known thinking tags
  (<think>, <reasoning>, <reflect>) instead of any <...> pattern. Prevents
  silently removing legitimate output like <username>content</username>.

RELIABILITY & DESIGN IMPROVEMENTS:
- load/main: Make n_ctx (context window size) configurable via --n-ctx flag
  (default 2048, range 1-32768) to support larger models like Qwen3-14B.
- generate_brewery: Prevent retry prompt growth by extracting location context
  into constant and using compact retry format (error + schema + location only).
  Avoids token truncation on final retry attempts.
- database: Fix data representativeness by changing QueryCities from
  ORDER BY name (alphabetic bias) to ORDER BY RANDOM() for unbiased sampling.
  Convert all SQLITE_STATIC to SQLITE_TRANSIENT to prevent use-after-free risks.

POLISH:
- infer: Advance sampling seed between generation calls to improve diversity
  across brewery and user generation.
- data_downloader: Remove unnecessary commit hash truncation; use full hash.
- json_loader: Fix misleading log message from "RapidJSON" to "Boost.JSON".
2026-04-03 11:58:00 -04:00
Aaron Po
8d306bf691 Update documentation for llama 2026-04-02 23:24:06 -04:00
Aaron Po
077f6ab4ae edit prompt 2026-04-02 22:56:18 -04:00
Aaron Po
534403734a Refactor BiergartenDataGenerator and LlamaGenerator 2026-04-02 22:46:00 -04:00
Aaron Po
3af053f0eb format codebase 2026-04-02 21:46:46 -04:00
Aaron Po
ba165d8aa7 Separate llama generator class src file into method files 2026-04-02 21:37:46 -04:00
Aaron Po
eb9a2767b4 Refactor web client interface and related components 2026-04-02 18:55:58 -04:00
Aaron Po
29ea47fdb6 update cli arg handling 2026-04-02 18:41:25 -04:00
Aaron Po
52e2333304 Reorganize directory structure 2026-04-02 18:27:01 -04:00
Aaron Po
a1f0ca5b20 Refactor DataDownloader and CURLWebClient: update constructor and modify FileExists method signature 2026-04-02 18:06:40 -04:00
Aaron Po
2ea8aa52b4 update readme and add clangformat and clang tidy 2026-04-02 17:12:22 -04:00
Aaron Po
98083ab40c Pipeline: add CURL/WebClient & Wikipedia service
Introduce a pluggable web client interface and concrete CURL implementation: adds IWebClient, CURLWebClient, and CurlGlobalState (headers + curl_web_client.cpp). DataDownloader now accepts an IWebClient and delegates downloads. Add WikipediaService for cached Wikipedia summary lookups. Refactor SqliteDatabase to return full City records and update consumers accordingly. Improve JsonLoader to use batched transactions during streaming parses. Enhance LlamaGenerator with sampling options, increased token limits, JSON extraction/validation, and other parsing helpers. Modernize CMake: set policy/version, add project_options, simplify FetchContent usage (spdlog), require Boost components (program_options/json), list pipeline sources explicitly, and tweak post-build/memcheck targets. Update README to match implementation changes and new CLI/config conventions.
2026-04-02 16:29:16 -04:00
Aaron Po
ac136f7179 Enhance brewery generation: add country name parameter and improve prompt handling 2026-04-02 01:04:41 -04:00
Aaron Po
280c9c61bd Implement Llama-based brewery and user data generation; remove mock generator and related files 2026-04-01 23:29:16 -04:00
Aaron Po
248a51b35f cleanup 2026-04-01 21:35:02 -04:00
Aaron Po
35aa7bc0df Begin work on biergarten data generator pipeline 2026-04-01 21:18:45 -04:00