Pipeline-debug — living TODO¶
Integration branch: fix/pipeline-debug (all items below land there until told otherwise). Last updated 2026-05-19.
✅ Done & on the branch¶
-
1234 console + 422/scope fixes (
d671cefb) merged.¶ -
1225 emission-recalc resilience + eager-pipeline-id (
ece34533) merged.¶ -
1236 Phase 1:
(4 mint sites) + runner post-pipelinestable + model +ensure_pipeline_exists¶finish_jobisolated status write +reconcile_pipeline_statusessweep + tests. -
1236 root cause:
finalize_ingest_meta— a FINISHED job with¶result != SUCCESSnever reports"Success"; shared by_run_ingest(csv/api/factor) andreference_ingest. - Console table: clickable full message + copy, server-resolved module/det names, distinct amber WARNING tier.
- 🐞 Guilbert #1: year-config gate (
13616a35backend,c1e9ef01frontend) —YearConfiguration.configuration_completedstamped byunit_sync_handleron SUCCESS;/dispatch409s when null; data-management page'syearSyncInFlightextended with the durable refresh-surviving check. - 🐞 Guilbert #3: SSE live-update on the ops page (
645b4799) — page subscribes to visible RUNNING pipelines viausePipelineStream, debounced refetch on any SSE update. - VERIFY (Phase 4 gate) — answered: aggregation handler is already scoped at
(module_type_id, year)viasvc.list_modules_for(...). Themodules_refreshed: 2231in the data is 2231carbon_report_modules(one per unit) for ONE module-year slice, not a full-table rewrite. The collision / amplification comes fromrecompute_stats's side-effect that also rewrites the parentcarbon_report.statsrollup — that row is shared across all modules of a unit-year, so 3 concurrent aggregations for different modules of the same year deadlock oncarbon_reports. Phase 4A's right lever is coalescing (one trailing aggregation per scope), not narrower scoping — scoping is already done. - Sibling hardcode trace — answered:
base_provider.py:186is a base-class default; every concrete subclass overridesingest(), so it's unreachable.base_reduction_objective_csv_provider.py:169belongs to a class with no@register'd handler — never invoked by the runner. Both are dead-code paths today;finalize_ingest_metacovers every LIVE handler.
🔎 To CHECK (verified at type/lint/unit level only — NOT runtime)¶
Honest gap: green ruff/mypy/eslint/vue-tsc/unit ≠ "works live".
- Console page renders on localhost: alert strip, filters narrow, row expand → DAG, message dialog + copy,
@click.stopvs row-expand, amber WARNING tier, named module/det show. - Bug-2 re-confirm:
?state=NOT_STARTEDreturns 200 (not 422) and the console shows many pipelines + tagged orphans (the permission-scope fix) — confirm on evidence. - Eager-pipeline-id end-to-end: a fresh dispatch persists
pipeline_id, creates thepipelinesrow, status advances. - #1236 Phase-1 on a real run: runner post-finish isolated write lands; induce a failure → log-and-skip + sweep heals.
-
alembic upgrade headapplies cleanly on a real Postgres DB (only parsed + SQLite-fixture tested so far). - One clean hook-driven commit (commitlint + lint-staged +
make type-check) — merges/commits used--no-verify; confirm the gate passes for real before any promotion. Concrete finding (2026-05-20): commitlint rejects scopes likedocs(pipeline-debug):(commit71bfe301blocked,rtk git commitprintedokanyway — the "ok" lies; verify withgit ls-tree/git show HEAD:per[[project_pipeline_debug_integration_branch]]). Either widen the commitlint scope-enum to includepipeline-debug, or use issue-numbered scopes (docs(#1234),docs(#1236)). - ✅ Sibling
"status_message": "Success"hardcodes traced — both unreachable (base default overridden by every concrete provider; reduction-objective class has no registered handler). See Done.
🐞 Newly discovered (Guilbert, 2026-05-20)¶
- ✅ #1 Year-config gate — backend
13616a35, frontendc1e9ef01. - ✅ #3 SSE on the ops page —
645b4799. - ✅ #2 shipped (see Done). Real chained sub-jobs (#2C) deferred — the phases checklist + status_history timeline already provide per-phase visibility in the console; revisit only if the project needs independent retry/locking per phase.
🔧 To DO — #1236 remaining phases¶
- ✅ Phase 2 (
acceae13): enforceddata_ingestion_jobs.pipeline_id→pipelines(id)FK via migrationc4d5e6f7a8b9(chains ona3b8c9d0e1f2). Addsix_data_ingestion_jobs_pipeline_id(Postgres doesn't auto-index the referencing column; console + recalc fan-out query bypipeline_idconstantly).ON DELETE RESTRICT(default) — pipelines are append-only ledger today. Model'ssa_columnupdated withForeignKey("pipelines.id")so SQLAlchemy schema view matches Postgres. v0.x = no backfill (DB dropped between deploys); migration applies on the next clean-DB deploy. 1385 unit tests still green (SQLite metadata build accepts the FK). - ✅ Phase 3 (
d8c3c682): flipped pipeline reads topipelines.status(durable, recompute-and-stored). -compute_pipeline_progress(jobs, *, pipeline=None)—done/has_errorderive frompipeline.statuswhen present;phasestays job-derived (UX granularity). -GET /sync/pipelines:state=URL param pivots toPipelineStatus(NOT_STARTED/RUNNING/SUCCESS/PARTIAL/FAILED);result=dropped (subsumed).has_errors=true↔status IN (PARTIAL, FAILED). Orphans fall back to job-derived. - Single + SSE endpoints pass the Pipeline row through. - Frontend filter UI:stateOptionsswap to the 5 values, the result dropdown is removed. - 60s reconciliation cron wired into the lifespan viaapp/tasks/_pipeline_reconciler.py. Same hygiene as the poller (session-per-iteration, broad except, cancellation). Settings:RUN_PIPELINE_RECONCILER=true,PIPELINE_RECONCILER_INTERVAL_SECONDS=60. - 🐞 FK-ordering bug surfaced + fixed: Phase 2 FK fired on stage; three mint sites violated the Pipeline-first invariant: (1)year_configuration.create_year_configurationhad noensure_pipeline_existscall, (2)data_sync.recalculate_emissionsand (3)data_sync.recalculate_module_emissionscalled it aftercreate_ingestion_job(whose flush already triggered the FK). All three fixed to ensure→create order. Regression test intest_pipeline_fk_ordering_regression.pyuses SQLite withPRAGMA foreign_keys=ONso neither bug shape can recur silently. - ✅ VERIFY (Phase 4 gate) — answered: aggregation handler is scoped at
(module_type_id, year). The 2231 number is per-unit module rows, not all reports. Collision source isrecompute_stats's side-effect rewrite of the parentcarbon_report.statssynthesis (shared row). Phase 4A lever is coalescing, not narrower scoping. - ✅ Phase 4A — done (3 commits): -
73ec4d644A.1 in-pipeline coalesce — last emission_recalc sibling chains aggregation; others skip. Race-safe via fresh-sessionSELECT … FOR UPDATEon parent +meta .recalc_work_completeflag. 3 sequential aggregations per upload → 1. -718cdd014A.2 per-yearpg_advisory_xact_lockinaggregation_handler— serialises cross-pipeline aggregations of the same year against sharedcarbon_reports.statsrows; no drop-hazard. Dialect-gated (SQLite skip). -1b20f9674A.3 scope toaffected_module_idsunion — aggregation rewrites only modules the recalc siblings actually touched (typically 432 vs 2231). Combined: amplification killed (4A.1), cross-pipeline deadlock eliminated (4A.2), per-aggregation write set shrunk (4A.3). - ✅ Phase 4A: shipped as 4A.1/4A.2/4A.3 (see Done section).
- ✅ 4A.4 (
53625315): race fix on 4A.3. The last sibling builds the fullaffected_module_idsunion (ownstats∪ FINISHED siblings' meta) at chain time and passes it viachain_job(config={...});aggregation_handlerreads from its ownmeta.configfirst (race-free), sibling-query stays as 4A.3 legacy fallback. Guardsisinstance(pipeline_id, UUID)to keep mock-driven unit tests off the productionSessionLocal. - ✅ Phase 4B (
1bd26748): per-(module, year)pg_advisory_xact_lockinfactor_ingest_handler,emission_recalc_handler,module_emission_recalc_handler. Shared helperacquire_factor_recalc_lockinapp/tasks/_locks.py, dedicated category1237(distinct from 4A.2's1236). Eliminates the silent-wrong-numbers race where a recalc reads half-written factors during a concurrent factor_ingest. - ✅ Phase 4B: shipped as
1bd26748(advisory lock at(module, year)scope, not(module, det, year)— broader but drop-hazard-free; see Done). - ✅ #2 (unit_sync sub-tasks visibility) — shipped as: -
4ee30046#2A genericstatus_history(append+capped at 50) -87a9d14d#2Bmeta.phaseschecklist onunit_synchandler -046f48e0#2D console renders timeline + phase checklist - #2C deferred: real chained sub-jobs (heavy refactor of the year-creation critical path) — #2B+#2D already provides the per-phase visibility. Keep in mind if real chained semantics ever become needed (independent retry per phase, separate locking, etc.). - ✅ Phase 5 (2 commits): retired the meta threading. -
6c3e762bPhase 5A —recompute_pipeline_statusnow writespipelines.expected_recalcon every recompute call (cheap UPDATE; not gated onprogress.doneso the column tracks live fan-out). Sets up 5B's read flip; purely additive. -a5f08a56Phase 5B — flipped reads + dropped meta writes: _compute_pipeline_progress._find_root→min(jobs, key=id)(dropped_ROOT_JOB_TYPESwhich omittedunit_sync/reference_ingestparents). _expected_recalcreadspipeline.expected_recalc; falls back to live job count for orphans / writer-side recompute. _ Phase-3 aggregation check: "all aggregation rows FINISHED" (no moremeta.aggregation_job_idset lookup). Docstring names the 4A.1 single-aggregation dependency. __is_last_recalc_sibling— lock target moved topipelinesrow (wasdata_ingestion_jobsparent); readspipeline.expected_recalc. Lock-down test (test_concurrent_siblings_yield_exactly_one_last) asserts exactly one sibling returns True — guards 4A.1's single- aggregation guarantee from the lock-target move. _ Dropped writes:meta.parent_job_id(_chain.py×2,emission_recalculation_tasks.py),meta.aggregation_job_id(×2),meta.recalc_jobs_chained(ingestion_tasks.py×3). __PIPELINE_META_ALLOW: 3 keys retired.
🔧 To DO — smaller follow-ups¶
- ✅ Lone-orphan
last_error(918f70d0):finalize_ingest_metanow appends a sample reason fromstats.row_errorswhen the summary path fires. Operators see "first error: No matching factor found in factors map (kind=Monitors, …)" instead of just "0 inserted, 50 072 skipped". Capped at 200 chars + ellipsis so a long reason can't bloatstatus_message. 4 regression tests cover the enrich path, the cap, the no-row-errors fallback, and the SUCCESS-path preservation. (Live SSE on the console moved up to "Newly discovered" — same issue, more specific framing.) - ✅ Autoflush FK bug (
23e47698) —data_sync._stamp_job_type_and_metaand_chain.chain_jobassignedrow.pipeline_id = Xbeforeensure_pipeline_exists; the SELECT inside the latter triggered autoflush which fired the FK on the half-set row. Both reordered; regression test extended (5 cases) to cover the autoflush shape. - ✅ Equipment / "common" upload fail-fast guards —
64760c85(handlerrequire_factor_to_match=Trueempty factors → raise at setup) +c306b3a4(per-module-type_FACTOR_INFERRED_MODULES = {equipment_electric_consumption, purchase}check: emptyfactors_mapraises before the row loop). User-reported 50k row-error log spam → one terminal error with the cause instatus_message. - ✅ Units bulk_upsert race + global unit_sync lock (
1a637165) — parallel year creation (2025+2026) crashed onix_units_institutional_code. Two-layer fix: 1.UnitRepository.bulk_upsertusesINSERT … ON CONFLICT (institutional_code) DO UPDATEon Postgres (race-safe by construction); SQLite fixture keeps legacy SELECT/merge. 2.unit_sync_handleracquires a GLOBAL 1-int advisory lock (category1239) BEFORE the per-year aggregation lock — serializes ALL unit_syncs regardless of year. Category distinctness pinned by test. - ✅ Pipeline Operations menu reorder + SA trump (
232ca077) —BACKOFFICE_PIPELINE_OPERATIONSmoved belowBACKOFFICE_LOGS.Co2Sidebar.isItemDisablednow short-circuits SuperAdmin past all gates (no scenario where SA should be locked out of a back-office page) — applies to User Management, Data Management, and Pipeline Operations symmetrically.
Latent (documented, fix shape known, not yet shipped)¶
-
UserRepository.bulk_upsert— same SELECT-then-merge race shape as the old units path. Currently safe because the global unitsync lock serialises the ONE production caller; the seed script has no concurrency. Fix shape:INSERT … ON CONFLICT (institutional_id) DO UPDATEmirroringunit_repo. Do it _when a second caller appears. -
reference_dataCSV ingest doesDELETE BuildingRoom; INSERT rooms(year-agnostic, like units). Single-operator upload today so the race is latent; if multiple ops ever ingest reference data concurrently, port the global-lock pattern fromunit_sync.
❓ Needs a decision (yours)¶
- ✅
PARTIALvsFAILEDboundary — shipped (04786254). Rule: root SUCCESS/WARNING + any descendant ERROR → PARTIAL (amber, "data landed, chain had issues"); root ERROR → FAILED (red, "data didn't land").compute_pipeline_progressnow shipsstatusin the progress payload so the frontend can tell PARTIAL from FAILED (both sethas_error=True).pipeops_status_partiali18n key finally has a renderer. - ✅ Aggregation coalescing —
unit_sync↔ aggregation concurrency safety shipped (e2b6ed77):unit_sync_handlernow acquires the samepg_advisory_xact_lock(1236, year)thataggregation_handlertakes (4A.2). Both rewritecarbon_reportsfor the same (unit, year) slice; sharing the category means they mutually exclude.test_unit_sync_lock_category_matches_aggregationpins the invariant so a future refactor can't silently split the categories. - ✅ Phase-3 sweep cron cadence: 60s (configurable via
PIPELINE_RECONCILER_INTERVAL_SECONDS). - ✅
pipeops_status_partiali18n key — kept (now rendered by the PARTIAL tier; see04786254).
⚙️ Process / integration¶
-
#1225+ eager-pipeline-id + #1236 live only onfix/pipeline-debug, not in anydevPR. Stage promotion needs a deliberate "promote the integration branch" step — decide when/how. - Verify/close PR #1235 (the original #1234→dev PR,
816c817b): superseded by the integration-branch state, like #1237 was. Confirm and close/retarget. - Note: commit
9ac03507("chore: lint-staged formatting") actually carried the Phase-1 scaffolding (model+migration+doc) — hook mislabel, content is correct in HEAD. History cosmetic only.