Implementation Plan: Stop persisting computed fields to DataEntry.data¶
Scope decision: single PR combining all changes below. DB cleanup: not in scope — DB will be dropped.
1. Diagnosis¶
DataEntryRepository.get_submodule_data (backend/app/repositories/data_entry_repo.py:341) runs a select(...) returning DataEntry ORM rows tied to the active AsyncSession. After unpacking each row at line 561, the loop reassigns data_entry.data = {**data_entry.data, ...} three times (lines 587, 602, 613). SQLAlchemy's attribute instrumentation marks the row dirty.
The mutation is persisted by:
- Autoflush — the
Factorlookup at line 580 (executed inside the same loop) triggers a flush of the dirtyDataEntry. - Explicit commit — write workflows (
carbon_report_module.py:111) reuse the same session and flush dirty rows.
A second source of leakage: CSV/API providers stuff a kg_co2eq override directly into DataEntry.data at ingest time, also persisting it.
Repo-wide grep (data_entry\.data\s*=) found:
repositories/data_entry_repo.py:120— legitimateupdate().repositories/data_entry_repo.py:587, 602, 613— the read-path bug.
No other read method mutates .data.
2. Option 1 — Don't touch the ORM in to_response¶
Signature change¶
Protocol (app/schemas/data_entry.py:131):
def to_response(
self,
data_entry: T,
enriched_data: dict | None = None,
) -> DataEntryResponseGen: ...
Each implementation does:
data = enriched_data if enriched_data is not None else data_entry.data
…and uses data in place of data_entry.data in the body.
In get_submodule_data, replace the three mutation blocks with a local enriched_data dict and pass it via handler.to_response(data_entry, enriched_data). The ORM row is never touched.
Files to edit¶
Repo:
backend/app/repositories/data_entry_repo.py(mutations at 587–620 → local dict)
Protocol:
backend/app/schemas/data_entry.py:131
Handler implementations (14 sites, 9 files):
backend/app/modules/buildings/schemas.py:249, 483, 599backend/app/modules/process_emissions/schemas.py:97backend/app/modules/external_cloud_and_ai/schemas.py:203, 385backend/app/modules/equipment_electric_consumption/schemas.py:227backend/app/modules/purchase/schemas.py:195, 296backend/app/modules/professional_travel/schemas.py:207backend/app/modules/research_facilities/animals_schemas.py:85backend/app/modules/research_facilities/common_schemas.py:101backend/app/modules/headcount/schemas.py:196, 251
3. Option 2 — expunge defense¶
After unpacking each row, call self.session.expunge(...) on every ORM instance returned. This detaches the instances so any later mutation cannot trigger a flush.
Apply in data_entry_repo.py to:
get_submodule_data(line 341)get_list(line 173)list_by_data_entry_type_and_year(line 195)get_headcount_members(line 778)get_member_by_institutional_id(line 808)
Add a private helper:
def _detach(self, *objs: Any) -> None:
for obj in objs:
if obj is not None:
self.session.expunge(obj)
Note: get_submodule_data only accesses scalar columns and the JSON data after unpack (no lazy relationships), so expunging is safe. Add a comment to that effect.
4. kg_co2eq must never be persisted in DataEntry.data¶
Where it leaks in today¶
backend/app/services/data_ingestion/base_csv_provider.py:802-809(special-case addskg_co2eqintofiltered_row); line 952 buildsDataEntry(data=...)with that key still inside.backend/app/services/data_ingestion/api_providers/professional_travel_api_provider.py:416-422(writeskg_co2eqintodata_payload).backend/app/services/data_entry_emission_service.py:182(prepare_createreadsdata_entry.data.get("kg_co2eq")as the override).backend/app/services/data_entry_emission_service.py:486-492(upsert_by_data_entryworkaround that stripskg_co2eqfrom the in-memory Pydantic copy before recompute — defensive, doesn't affect DB).
New transient channel¶
Extend prepare_create:
async def prepare_create(
self,
data_entry: DataEntry | DataEntryResponse,
kg_co2eq_override: float | None = None,
) -> list[DataEntryEmission]:
...
if kg_co2eq_override is not None:
# CSV/API ingestion path: build override emission record
Drop the data_entry.data.get("kg_co2eq") fallback at line 182 — the override is now exclusively passed in.
Provider changes¶
- CSV provider (
base_csv_provider.py): in_process_row, extractkg_co2eqfromfiltered_rowbefore theDataEntry(data=...)build (line 952), keep it on a parallel structure (e.g.data_entry._kg_co2eq_overridetransient attribute, or zip alongside batch). Strip the key fromdataso the persisted row is clean. In_persist_batch(line 1126 region), passkg_co2eq_override=...toprepare_create.
Easiest carrier: add a private _csv_overrides_by_idx: dict[int, float] on the provider instance keyed by the index in the batch list, and look it up after bulk_create returns the response objects in batch order.
- API provider (
professional_travel_api_provider.py): mirror the CSV provider — don't writekg_co2eqintodata_payload; track it on a parallel mapping; pass toprepare_create.
Remove the workaround¶
upsert_by_data_entry (data_entry_emission_service.py:486-492): delete the strip block. It exists only to defend against the read-path mutation; once Option 1 lands, it's dead code.
5. EnergyCombustion latent bug fix¶
buildings/schemas.py:484-485:
primary_factor = data_entry.data.get("primary_factor", {})
factor_values = primary_factor.get("values", {}) # always {} — there is no nested "values"
The repo builds primary_factor as a flat {**values, **classification} dict. Fix the read site to use the flat shape:
data = enriched_data if enriched_data is not None else data_entry.data
primary_factor = data.get("primary_factor", {})
# primary_factor is already flat (factor.values merged with factor.classification)
factor_values = primary_factor
(Or pick whichever shape makes the response_dto field mapping work — verify against EnergyCombustionHandlerResponse field names.)
⚠️ PR-description note (frontend-visible behavior change): Today,
factor_values = primary_factor.get("values", {})always returns{}, so the response fieldsname(andunitvia the same path) are derived only fromdata_entry.data.get("name") / get("unit")— and wereNoneon most rows. After the fix,name/unitwill start populating from the factor'skind/unitclassification when those exist on the matched factor. The frontend's energy combustion submodule listing will start showing those values. Manual UI verification recommended (the user has agreed to check this manually).
6. Tests¶
Repo regression — read path doesn't pollute data¶
backend/tests/unit/repositories/test_data_entry_repo.py:
@pytest.mark.asyncio
async def test_get_submodule_data_does_not_persist_computed_fields(
db_session: AsyncSession,
):
"""Listing must not write computed fields back to DataEntry.data."""
# arrange: build a plane DataEntry with input-only data
# act: call get_submodule_data, then commit
# assert: refreshed row's `data` matches original input exactly
The test should fail today and pass after Options 1 & 2.
Add an analogous test for is_buildings_entry (asserts room_surface_square_meter is absent unless it was in the original input).
CSV regression — kg_co2eq only in emission, not in data¶
Per-year and per-module fixtures. Module coverage: every CSV-importable module that may carry a kg_co2eq column. Concretely, professional_travel (plane/train), buildings (energy_combustion, building_room), purchase, equipment, external_cloud_and_ai, headcount.
For each fixture:
- Run the CSV importer.
- Query
data_entriesand assert no row haskg_co2eqkey indata. - Query
data_entry_emissionsand assert at least one row has the expectedkg_co2eqvalue. - Run a recompute (e.g. update the parent module) and assert the emission's
kg_co2eqis recomputed via formula (not via the stripped CSV value) when there's no override on file.
Fixture location: backend/tests/fixtures/csv/regression_kg_co2eq/<year>/<module>.csv (dumb 2–3 row files).
API-provider regression¶
Mirror the CSV test for the professional-travel API provider — feed it a synthetic API response with OUT_CO2_CORRECTED, assert the persisted DataEntry.data has no kg_co2eq key and DataEntryEmission.kg_co2eq matches the API value.
Unit test — prepare_create(kg_co2eq_override=...)¶
Direct unit test on DataEntryEmissionService.prepare_create:
- Given a
DataEntrywhosedatadoes NOT containkg_co2eq, calling withkg_co2eq_override=42.0returns an emission withkg_co2eq=42.0. - Without the override, the formula path runs.
7. Risks¶
kg_co2eq_overrideoverride semantics: the existing override path skipped factor-based formula computation entirely. After the refactor, that semantic must hold — the newkg_co2eq_overrideparam must produce the same single-emission row withprimary_factor_id=None.EnergyCombustionshape fix: ensure response_dto fields still resolve. The latent bug means consumers were getting{}forfactor_values; verify the response_dto doesn't depend on avalues-keyed sub-dict.- API provider parity: ensure the same override path works in the API provider's call to
bulk_create+prepare_create(verify it uses the same emission_service surface).
8. Implementation order (single PR)¶
- Add
kg_co2eq_overrideparam toprepare_create; removedata.get("kg_co2eq")fallback. Update unit tests. - Update CSV provider to strip
kg_co2eqfromdataand pass override transiently. - Update API provider similarly.
- Remove workaround strip in
upsert_by_data_entry. - Change
to_responseprotocol + 14 implementations (Option 1). - Build
enriched_datainget_submodule_data; remove ORM mutations. - Add
_detach/session.expungeto the 5 read methods (Option 2). - Fix
EnergyCombustionlatent bug. - Add regression tests (repo + CSV fixtures + API + prepare_create unit).
- Run full test suite.
Critical files¶
backend/app/repositories/data_entry_repo.pybackend/app/schemas/data_entry.pybackend/app/services/data_entry_emission_service.pybackend/app/services/data_ingestion/base_csv_provider.pybackend/app/services/data_ingestion/api_providers/professional_travel_api_provider.pybackend/app/modules/buildings/schemas.pybackend/app/modules/professional_travel/schemas.pybackend/app/modules/process_emissions/schemas.pybackend/app/modules/external_cloud_and_ai/schemas.pybackend/app/modules/equipment_electric_consumption/schemas.pybackend/app/modules/purchase/schemas.pybackend/app/modules/research_facilities/animals_schemas.pybackend/app/modules/research_facilities/common_schemas.pybackend/app/modules/headcount/schemas.pybackend/tests/unit/repositories/test_data_entry_repo.pybackend/tests/fixtures/csv/regression_kg_co2eq/...