Plan 310-D Follow-up — Strategy B Rematch + Per-Module ITs¶
Context¶
Plan 310-D's EmissionRecalculationWorkflow.recalculate_for_data_entry_type (landed in PR #1027) rematches only the JSON-link modules — modules whose primary_factor_id lives on data_entries.data->primary_factor_id. The FK-link modules, whose factor link lives on data_entry_emissions.primary_factor_id, are skipped entirely by the existing gate handler.kind_field in entry.data.
Modules affected (see the rematch table in emission-pipeline-flow.md):
- Travel (plane / train) — factor resolved via
FactorQueryfrompre_compute-derivedhaul_category/country_code. One emission per entry. - Headcount (member / student) —
FactorQueryreturns N sub-factors per data_entry; one emission row per sub-factor. - Building Embodied Energy — confirmed FK-link via PR #1042's per-module integration test; the
building_name × categoryfactor key drives a 1:N emission shape.
building_room was originally listed here as FK-link but PR #1042's audit confirmed it carries primary_factor_id on entry.data with kind_field='building_name' / subkind_field='room_type' — the existing JSON-link bulk-prefetch path covers it. The 1:N emission fan-out (one row per room_type) was the novel coverage that justified its IT living next to the FK-link tests in test_strategy_b_rematch_pg.py, but the link itself is JSON.
Goal¶
Add a Strategy B rematch path that:
- Walks
data_entry_emissionsfor the slice(data_entry_type_id, year). - For each emission row, looks up the new factor by the OLD factor's classification dict.
- Updates the row's
primary_factor_idand recomputeskg_co2eq. - On dict miss, applies the same strict-drop rule from PR #1027: clear
primary_factor_id, setkg_co2eq = None.
Same year-strict and kind→subkind→kind-only fallback rules as the JSON-link path.
Approach¶
# Inside EmissionRecalculationWorkflow.recalculate_for_data_entry_type,
# after the existing JSON-link branch handles its own entries:
is_strategy_b = (
handler.kind_field is not None
and not any(handler.kind_field in e.data for e in entries)
)
if is_strategy_b:
# 1. Bulk-fetch factors (same query as Strategy A).
factors = await factor_repo.list_by_data_entry_type(det, year)
factor_lookup = build_factor_lookup(factors, handler)
# 2. Walk emission rows for the slice.
emission_rows = await emission_repo.list_by_data_entry_type_and_year(
data_entry_type_id, year
)
for row in emission_rows:
old_factor = await factor_repo.get(row.primary_factor_id)
new_id = lookup_via_classification(
old_factor.classification, handler, factor_lookup
)
# Strict-drop: miss → clear FK + recompute None.
await emission_repo.update_factor_and_recompute(row, new_id)
The update_factor_and_recompute step is an open question — see "Open questions" below.
Open questions¶
- Old-classification lookup vs. re-derive via
pre_compute. - Old-classification: simpler, single bulk fetch, no
Locationqueries. Correct as long as the entry's underlying data hasn't changed since the last compute. - Re-derive: re-runs
pre_computeper entry (DB-heavy for travel — fetches origin/destinationLocationagain). More correct if entries can mutate after creation, but pricier.
Recommend: old-classification for Plan 310-D's scope. Re-derive moves to a future "rematch on entry edit" workflow.
-
Should we touch
entry.dataat all for Strategy B? Today the Strategy A path storesprimary_factor_idonentry.datafor forward-compat. Travel today does NOT. Keep the asymmetry — don't start storing on travel/headcount/rooms entries. -
Building Embodied Energy verification. Inspect
backend/app/workflows/embodied_energy.pyand confirm 1:N emission shape before assuming it lands in this PR. If it's actually 1:1 with the FK ondata_entry_emissionsonly (no JSON link), it still belongs to this Strategy B path.
Per-module integration tests¶
One PG-backed IT per (module, data_entry_type). Each test:
- Seed:
Unit→CarbonReport(year=…)→CarbonReportModule→Factor(...)→DataEntry(...)→ trigger initialDataEntryEmissioncompute (viaDataEntryEmissionService.upsert_by_data_entryor the POST-create handler workflow). - Trigger a "factor changed" event by either:
- Calling
factor_repo.upsert_factors([new_factor])directly with a differentvaluespayload. - OR posting to
/v1/sync/dispatchwith a v2 factor CSV (mirrorstest_plan_310b_factor_reupload_endpoint_pg.py). - Run
EmissionRecalculationWorkflow.recalculate_for_data_entry_type. - Assert
data_entry_emissions.kg_co2eqreflects the new factor (orNoneif the factor was removed in step 2).
Coverage matrix (one IT per row):
| Module | data_entry_type | Link kind | Notes |
|---|---|---|---|
| equipment_electric_consumption | it | JSON | Already covered by Plan 310-B PG tests |
| equipment_electric_consumption | scientific | JSON | new IT |
| equipment_electric_consumption | other | JSON | new IT |
| purchase | purchase_common | JSON | new IT |
| purchase | purchase_additional | JSON | new IT |
| external_cloud_and_ai | external_cloud | JSON | new IT |
| external_cloud_and_ai | external_ai | JSON | new IT |
| process_emissions | process_emission | JSON | new IT |
| buildings (energy_combustion) | building_energy_combustion | JSON | new IT |
| buildings (rooms) | building_room | JSON | 1:N emissions (1 per room_type); link is JSON, fan-out shape was the novel bit |
| buildings (embodied) | building_embodied_energy | FK | new IT, strategy-B path |
| professional_travel | plane | FK | new IT, strategy-B path |
| professional_travel | train | FK | new IT, strategy-B path |
| headcount | member | FK | new IT, strategy-B path |
| headcount | student | FK | new IT, strategy-B path |
PR #1042 shipped 16 ITs total (split into two files for module-family locality): 9 in test_strategy_a_rematch_pg.py (the JSON-link rows above plus a strict-drop variant that lives next to its sibling tests in the same file) and 7 in test_strategy_b_rematch_pg.py (the FK-link rows plus a strict-drop variant for travel). The building_room row moved into the Strategy B file even though the link is JSON — the 1:N fan-out shape was the novel coverage and lives next to embodied energy for that reason. Inline _seed_unit_and_module / _initial_compute helpers were left duplicated rather than promoted to a conftest.py shared fixture; ~30 lines of duplication across the two files was judged not worth a dedicated commit cycle.
Out of scope¶
- Changes to
ModuleHandlerService.resolve_primary_factor_idor individual module handlers (rematch reuses existing factor-lookup semantics — it does not redefine them). - Concurrency / locking improvements during rematch (the existing
claim_job/is_currentmechanism from Plan 310-A is reused). - The
aggregationjob (Plan 310-D's bulk-path-pure-async work) — that's a separate Plan-D-Tier-2 PR.
Critical files¶
| File | Change |
|---|---|
backend/app/workflows/emission_recalculation.py | Add Strategy B branch in recalculate_for_data_entry_type |
backend/app/repositories/data_entry_emission_repo.py | Add list_by_data_entry_type_and_year(det, year) if missing |
backend/tests/integration/services/data_ingestion/test_strategy_b_rematch_pg.py | New file — 4 strategy-B per-module ITs |
backend/tests/integration/services/data_ingestion/test_strategy_a_rematch_pg.py | New file — 9 JSON-link per-module ITs (regression net for #1027) |
backend/tests/integration/services/data_ingestion/conftest.py | Add seed_module_with_emission helper fixture |
Verification¶
After implementation:
rtk uv run pytest tests/integration/services/data_ingestion/test_strategy_b_rematch_pg.py -v— all 4 Strategy B modules pass.rtk uv run pytest tests/integration/services/data_ingestion/test_strategy_a_rematch_pg.py -v— regression net catches no breakage from the new branch.rtk uv run pytest tests/unit/workflows/test_emission_recalculation.py -v— existing 10 unit tests still pass (no churn in the JSON path).- End-to-end smoke: upload a v2 travel factor CSV via
/v1/sync/dispatch, confirm a previously-computed travel emission row'skg_co2equpdates without the operator needing to delete/recreate the entry.
Related PRs
-
1027 — Plan 310-D batch rematch (JSON-link modules), strict-drop semantics. This follow-up rides on top.¶
-
1042 — Delivered: per-module IT matrix + plan-doc realignment (see below).¶
- (future) —
aggregationhandler + dedup index migration (separate Plan-D Tier-2 PR).
Delivered: PR #1042¶
The investigation phase of this PR turned up a key empirical finding that simplified the deliverable substantially. Documented here so the next reader can build on the right architecture.
What actually shipped¶
- No workflow code change.
EmissionRecalculationWorkflowis unchanged from PR #1027. - 16 PG-backed integration tests covering Plan 310-D's 14-row matrix:
backend/tests/integration/services/data_ingestion/test_strategy_b_rematch_pg.py— 7 tests for the FK-link path (plane values, plane drop, train, member, student, embodied energy, building rooms 1:N).backend/tests/integration/services/data_ingestion/test_strategy_a_rematch_pg.py— 9 tests for the JSON-link path (equipment × 3, purchase × 2, external cloud, external AI, process emissions, energy combustion).- Plan-doc realignment (this section).
Why the plan's "walk data_entry_emissions" branch was not needed¶
The plan assumed FK-link entries (travel, headcount, embodied energy) were skipped entirely by the rematch path. Empirically:
EmissionRecalculationWorkflowwalks everyDataEntryfor(det, year)and callsupsert_by_data_entryon each. The gate athandler.kind_field in entry.dataonly short-circuits theentry.data['primary_factor_id']rewrite — a JSON-link concern. The per-entryupsert_by_data_entryruns unconditionally.upsert_by_data_entry→prepare_createruns the handler'spre_compute(re-resolvinghaul_categoryfrom Locations,country_codefrom train stations, etc.) and then_fetch_factorson eachEmissionComputation._fetch_factorsindata_entry_emission_service.py:343-415already runs the live Strategy B classification query — same factor lookup the initial compute used. A factor whosevalueschanged is picked up automatically; a factor whose row was deleted produces an empty list and the per-entry strict-drop kicks in.
The only thing missing was test coverage to make this contract durable.
Strict-drop contract clarification¶
The plan's "preserve row with primary_factor_id=None, kg_co2eq=None" language is imprecise. The actual behaviour shared by both Strategy A and Strategy B is delete the entry's emission rows:
- Strategy A: bulk lookup miss →
entry.data['primary_factor_id']set toNone→resolve_computationsreturns[]→upsert_by_data_entryhits theif not prepared_emissionsbranch →delete_by_data_entry_id. - Strategy B:
_fetch_factorsreturns[](classification miss or factor row deleted) → noDataEntryEmissionrows produced → samedelete_by_data_entry_idbranch.
The dashboards surface this as "no emission for this entry", which matches the plan's intended operator-visible signal.
Module reclassification¶
The plan listed buildings__rooms (DataEntryTypeEnum.building) under FK-link. In fact BuildingRoomModuleHandler declares kind_field='building_name' and subkind_field='room_type', both on entry.data — so the existing Strategy A bulk-prefetch covers it. The test_building_room_factor_values_change_propagates_all_5_emissions test in the Strategy B file lives there because rooms emits 1:N (5 emission rows per entry, one per energy type), which is the only genuinely new shape the FK-link conversation introduced. Verifying all 5 rows scale linearly with the factor's ef_kg_co2eq_per_kwh belongs to this PR even though the lookup path itself is JSON-link.
Files touched¶
| File | Change |
|---|---|
backend/app/workflows/emission_recalculation.py | No change. Plan's Strategy B branch turned out unnecessary. |
backend/tests/integration/services/data_ingestion/test_strategy_b_rematch_pg.py | New file — 7 FK-link ITs incl. 1:N rooms regression. |
backend/tests/integration/services/data_ingestion/test_strategy_a_rematch_pg.py | New file — 9 JSON-link ITs (regression net for #1027). |
docs/src/implementation-plans/310-d-strategy-b-rematch.md | This "Delivered: PR #1042" section. |
What's still on the follow-up list¶
The plan's "Open question 1" (old-classification lookup vs. re-derive via pre_compute) collapses to "re-derive" because that's what the existing per-entry path already does. If a future "rematch on entry edit" workflow wants to skip the LocationService roundtrip on travel, it could read meta['distance_km'] from the existing DataEntryEmission row — but that's a perf optimisation, not a correctness gap.