CSV seed formats¶
Why this exists¶
The backend ingests CSV files from operator uploads, integration-test fixtures, and (historically) seed scripts. Each upload routes through a CSV provider in backend/app/services/data_ingestion/csv_providers/. This page lists the supported formats, their column schemas, and idempotency rules — so a developer or LLM agent extending ingestion knows what to produce, parse, or assert against.
For the database entities these CSVs feed, see Database ERD.
Where to get the seed CSVs¶
The backend/seed_data/ directory is not tracked in git. It is a local mirror of a private SharePoint folder maintained by the team. A fresh clone will not have it. To populate it, ask a team member for access to the SharePoint source.
For a self-contained example of the CSV shapes documented below, see backend/tests/integration/data_ingestion/fixtures/ — the canonical in-repo reference for these formats.
Three families: factors, data, test¶
The seed-data folder follows a strict naming convention. Every CSV belongs to one of three families, distinguished by suffix:
*_factors.csv— emission factors (ef_*) and unit definitions. One row per(classification, year)tuple. Loaded into thefactorstable.*_data.csv— per-unit observations:unit_institutional_id+ entity fields + optionalnoteand optionalkg_co2eqoverride. One row per data entry. Loaded intodata_entries.*_test.csv— a subset of*_data.csvcolumns withunit_institutional_idandkg_co2eqremoved. Used as upload-pipeline fixtures, not seed data.
flowchart LR
F["*_factors.csv row<br/>(ef_*, unit, classification)"] --> J{kg_co2eq<br/>in row?}
D["*_data.csv row<br/>(unit_id + entity fields)"] --> J
J -->|yes| OV["Use CSV value<br/>as-is"]
J -->|no| C["Compute<br/>quantity × ef × adj"]
OV --> E[(data_entry_emissions<br/>kg_co2eq)]
C --> E For the runtime computation path see data_entry_emission_service.py; for the override pass-through see base_csv_provider.py.
kg_co2eq override semantics¶
Override behavior
The kg_co2eq column on *_data.csv is an override, not an input. When the column is present and non-empty, the backend skips the factor-based computation and uses the CSV value verbatim.
The two relevant code points:
base_csv_provider.py:794-799— even thoughkg_co2eqis not in any handler'screate_dto, the provider explicitly preserves it from the source row when present and non-empty.data_entry_emission_service.py:143-151— emission compute logsUsing CSV-provided kg_co2eq=… overrideand returns the float as-is, bypassing factor lookup.
For the complete per-file column tables, see CSV column inventory.
Provider overview¶
flowchart LR
CSV[CSV file] --> PF[ProviderFactory]
PF -->|MODULE_PER_YEAR + DATA_ENTRIES| MPY[ModulePerYearCSVProvider]
PF -->|MODULE_UNIT_SPECIFIC + DATA_ENTRIES| MUS[ModuleUnitSpecificCSVProvider]
PF -->|FACTORS| FAC[ModulePerYearFactorCSVProvider]
PF -->|REDUCTION_OBJECTIVES| RED[ModulePerYearReductionObjectivesApiProvider]
PF -->|REFERENCE_DATA| REF[ModulePerYearReferenceDataApiProvider]
MPY --> DE[(data_entries)]
MUS --> DE
FAC --> FT[(factors)]
RED --> YC[(year_configuration.config)] Routing keys live in backend/app/services/data_ingestion/provider_factory.py.
Common conventions¶
CSV providers share many — but not all — defaults. The list below splits them into truly-shared behaviour and per-provider differences.
Shared by all providers¶
- Delimiter: comma. Quote character: double quote.
- Header row: required; column order is free.
- Extra columns: silently ignored.
- Empty file: rejected with
CSV file is empty(base_csv_provider.py:244,base_factor_csv_provider.py:300,base_reduction_objective_csv_provider.py:320). The data-entry path wraps this message with the prefix described below; the factor and reduction-objective paths raise a bareValueError. - Source paths: must start with
tmp/,uploads/, ortemporary/(base_csv_provider.py::_validate_file_path).
Per-provider differences¶
| Aspect | BaseCSVProvider (data entries) | BaseFactorCSVProvider (factors, reference data) | BaseReductionObjectiveCSVProvider |
|---|---|---|---|
| Encoding | utf-8-sig (BOM tolerated) — base_csv_provider.py:746 | plain utf-8 (no BOM tolerance) — base_factor_csv_provider.py:272 | utf-8-sig (BOM tolerated) — base_reduction_objective_csv_provider.py:279 |
| Header-validation error wrapping | Wrong CSV format or encoding: <message> (base_csv_provider.py:768) | bare ValueError (base_factor_csv_provider.py::_validate_csv_headers) | bare ValueError (base_reduction_objective_csv_provider.py::_validate_csv_headers) |
Module-per-year CSV (data entries)¶
- Provider:
ModulePerYearCSVProvider - Path:
backend/app/services/data_ingestion/csv_providers/module_per_year.py - Target entity:
data_entries(one row per CSV row). - Used for: headcount, professional travel, buildings, purchases, research facilities, etc.
Required and expected columns¶
The only column required by the provider for every row is unit_institutional_id (module_per_year.py:111). Remaining columns are derived from the active BaseModuleHandler subclasses' create_dto.model_fields (see _get_expected_columns_from_handlers in base_csv_provider.py:79).
| Column | Type | Required | Description |
|---|---|---|---|
unit_institutional_id | string | yes | Unit code; resolves to units.institutional_id. |
| handler-specific fields | varies | varies | Defined by the matching BaseModuleHandler.create_dto. |
note | string | no | Free-form annotation kept on the data entry. |
For the headcount-member handler the columns are unit_institutional_id, name, position_title, position_category, user_institutional_id, fte, note (verified against backend/tests/integration/data_ingestion/fixtures/valid_module_per_year.csv).
Idempotency¶
Delete-then-insert per affected module: rows previously inserted with source = CSV_MODULE_PER_YEAR are removed before the new batch is loaded (base_csv_provider.py::_delete_existing_entries_for_module_per_year, implementation plan docs/src/implementation-plans/220-csv-upload-implementation-summary.md). Manual user entries (source = USER_MANUAL) are never touched.
Example¶
unit_institutional_id,name,position_title,position_category,user_institutional_id,fte,note
UNIT001,John Doe,Professor,professor,UID001,1.0,
UNIT002,Jane Smith,Post-doctoral researcher,postdoctoral_assistant,UID002,0.5,
Module-unit-specific CSV (data entries)¶
- Provider:
ModuleUnitSpecificCSVProvider - Path:
backend/app/services/data_ingestion/csv_providers/module_unit_specific.py - Target entity:
data_entriesfor one specificcarbon_report_module_id. - Used for: equipment, external AI, external clouds, and other unit-scoped uploads.
Required and expected columns¶
The provider requires a single data_entry_type_id in the upload config and loads one handler. Required columns come from the handler's create_dto.model_fields filtered by is_required() (base_csv_provider.py::_get_required_columns_from_handler). The CSV does not include unit_institutional_id — the unit is fixed by carbon_report_module_id from the API request.
For the equipment handler the columns are name, equipment_class, sub_class, active_usage_hours_per_week, standby_usage_hours_per_week, note (verified against backend/app/modules/equipment/schemas.py::EquipmentHandlerCreate).
Validation rules¶
active_usage_hours_per_week + standby_usage_hours_per_week <= 168(EquipmentModuleHandler.pre_compute).- Other handlers attach their own
field_validators to the create DTO.
Idempotency¶
Append-only. No prior rows are deleted: the deletion call is gated by an if self.entity_type == EntityType.MODULE_PER_YEAR guard (base_csv_provider.py:568), so MODULE_UNIT_SPECIFIC never enters the _delete_existing_entries_for_module_per_year branch. Confirmed in docs/src/implementation-plans/220-csv-upload-implementation-summary.md.
Example¶
name,equipment_class,sub_class,active_usage_hours_per_week,standby_usage_hours_per_week,note
Thermostat Numerique,Agitator / Incubator,CO2 incubators,12,150,Test equipment
ECRAN EIZO FLEXSCAN,Monitors,,8,160,
Factors CSV¶
- Provider:
ModulePerYearFactorCSVProvider - Path:
backend/app/services/data_ingestion/csv_providers/factors.py - Target entity:
factors. - Config inputs:
module_type_id, optionaldata_entry_type_id,year.
Columns¶
Columns vary per BaseFactorHandler subclass. The expected set is create_dto.model_fields - FACTOR_META_FIELDS, where the meta-fields excluded from CSVs are id, classification, values, emission_type_id, data_entry_type_id, year (backend/app/schemas/factor.py:11). To find the columns for a given factor type, read its *FactorCreate class in backend/app/modules/<module>/schemas.py.
Example — TravelPlaneFactorHandler (verified at backend/app/modules/professional_travel/schemas.py:412):
| Column | Type | Required | Description |
|---|---|---|---|
category | enum | yes | very_short_haul, short_haul, medium_haul, or long_haul. |
ef_kg_co2eq_per_km | float | yes | Emission factor; must be >= 0. |
rfi_adjustment | float | yes | Radiative-forcing index multiplier. |
class_adjustement | float | yes | Cabin-class multiplier (note: legacy spelling). |
min_distance | float | yes | Lower bound of the distance band (km). |
max_distance | float | yes | Upper bound of the distance band (km). |
Idempotency¶
Delete-then-insert by (data_entry_type_id, year) (base_factor_csv_provider.py:173-191). All existing factor rows for that combination are removed via factor_service.bulk_delete_by_data_entry_type before the new batch is inserted via factor_service.bulk_create. The set of types deleted is computed by _get_types_to_delete (base_factor_csv_provider.py:572-582); subclasses (notably LocalFactorCSVProvider in csv_providers/local_seed.py) override that hook to scope deletion to specific data-entry types when a single CSV covers only part of a module's types.
Example¶
category,ef_kg_co2eq_per_km,rfi_adjustment,class_adjustement,min_distance,max_distance
very_short_haul,0.258,1.0,1.0,0,500
short_haul,0.156,2.0,1.0,500,1500
medium_haul,0.131,2.0,1.2,1500,3500
long_haul,0.151,2.0,1.4,3500,20000
Reduction-objective CSV¶
- Provider:
ModulePerYearReductionObjectivesApiProvider - Path:
backend/app/services/data_ingestion/csv_providers/reduction_objectives.py - Target entity:
year_configuration.config.reduction_objectives.<key>(single JSON blob per category, not row-per-row). - Selector:
reduction_objective_type_idin the upload config picks one of three sub-formats (backend/app/schemas/year_configuration.py:119).
Sub-formats¶
reduction_objective_type_id | config_key | Required columns |
|---|---|---|
0 (FOOTPRINT) | institutional_footprint | year, category, co2 |
1 (POPULATION) | population_projections | year, pop |
2 (SCENARIOS) | unit_scenarios | scenario, year, reduction_percentage |
Validation rules (backend/app/schemas/year_configuration.py:43-94):
co2 >= 0,pop >= 0,0.0 <= reduction_percentage <= 1.0.
Idempotency¶
The whole validated row set is written as a JSON list under year_configuration.config.reduction_objectives.<config_key>, replacing the previous value for that key.
Example (FOOTPRINT)¶
year,category,co2
2024,energy,1234.5
2024,food,567.8
2024,travel,890.1
Reference-data CSV¶
- Provider:
ModulePerYearReferenceDataApiProvider - Path:
backend/app/services/data_ingestion/csv_providers/reference_data.py - Status: Setup currently returns
{}; column schema and validation rules are TBD. Do not rely on this format until the provider is fleshed out. Track work against the file above.
Where fixtures live¶
Integration-test CSV fixtures (the ground truth for shape and edge cases) sit under backend/tests/integration/data_ingestion/fixtures/. The fixture README in that directory documents valid examples, missing-column failures, extra-column tolerance, and empty-file handling.
Note: an older comment in that README points to backend/seed_data/ for production CSVs. That directory does not exist on main today; treat the fixtures dir as the canonical reference.
How to add a new format¶
- Subclass
BaseCSVProvider,BaseFactorCSVProvider, orBaseReductionObjectiveCSVProviderand implement the abstract hooks (_setup_handlers_and_factors/_setup_handlers_and_context/_resolve_handler). - Register the provider in
ProviderFactory.PROVIDERS(orCOMPUTED_FACTOR_PROVIDERS) inbackend/app/services/data_ingestion/provider_factory.pyunder the right(module_type, ingestion_method, target_type, entity_type)key. - Add a fixture CSV under
backend/tests/integration/data_ingestion/fixtures/and document its shape in the fixtures README. - Add an integration test under
backend/tests/integration/data_ingestion/that exercises happy-path, missing-column, and idempotency scenarios.