fix: address critical correctness, reliability, and design issues in pipeline

CORRECTNESS FIXES:
- json_loader: Add RollbackTransaction() and call it on exception instead of
  CommitTransaction(). Prevents partial data corruption on parse/disk errors.
- wikipedia_service: Fix invalid MediaWiki API parameter explaintext=true ->
  explaintext=1. Now returns plain text instead of HTML markup in contexts.
- helpers: Fix ParseTwoLineResponse filter to only remove known thinking tags
  (<think>, <reasoning>, <reflect>) instead of any <...> pattern. Prevents
  silently removing legitimate output like <username>content</username>.

RELIABILITY & DESIGN IMPROVEMENTS:
- load/main: Make n_ctx (context window size) configurable via --n-ctx flag
  (default 2048, range 1-32768) to support larger models like Qwen3-14B.
- generate_brewery: Prevent retry prompt growth by extracting location context
  into constant and using compact retry format (error + schema + location only).
  Avoids token truncation on final retry attempts.
- database: Fix data representativeness by changing QueryCities from
  ORDER BY name (alphabetic bias) to ORDER BY RANDOM() for unbiased sampling.
  Convert all SQLITE_STATIC to SQLITE_TRANSIENT to prevent use-after-free risks.

POLISH:
- infer: Advance sampling seed between generation calls to improve diversity
  across brewery and user generation.
- data_downloader: Remove unnecessary commit hash truncation; use full hash.
- json_loader: Fix misleading log message from "RapidJSON" to "Boost.JSON".
This commit is contained in:
Aaron Po
2026-04-03 11:58:00 -04:00
parent 8d306bf691
commit e4e16a5084
14 changed files with 202 additions and 121 deletions

View File

@@ -11,7 +11,7 @@ std::string WikipediaService::FetchExtract(std::string_view query) {
const std::string encoded = client_->UrlEncode(std::string(query));
const std::string url =
"https://en.wikipedia.org/w/api.php?action=query&titles=" + encoded +
"&prop=extracts&explaintext=true&format=json";
"&prop=extracts&explaintext=1&format=json";
const std::string body = client_->Get(url);
@@ -19,16 +19,27 @@ std::string WikipediaService::FetchExtract(std::string_view query) {
boost::json::value doc = boost::json::parse(body, ec);
if (!ec && doc.is_object()) {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
std::string extract(page.at("extract").as_string().c_str());
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
return extract;
try {
auto& pages = doc.at("query").at("pages").get_object();
if (!pages.empty()) {
auto& page = pages.begin()->value().get_object();
if (page.contains("extract") && page.at("extract").is_string()) {
std::string extract(page.at("extract").as_string().c_str());
spdlog::debug("WikipediaService fetched {} chars for '{}'",
extract.size(), query);
return extract;
}
}
} catch (const std::exception& e) {
spdlog::warn(
"WikipediaService: failed to parse response structure for '{}': "
"{}",
query, e.what());
return {};
}
} else if (ec) {
spdlog::warn("WikipediaService: JSON parse error for '{}': {}", query,
ec.message());
}
return {};