Plan: Faker seed-data — ~800 rows, aligned to current ERD¶

TL;DR¶

The random Faker seeder under backend/app/seed/random_generator/ was sized for a multi-million-row stress test and had drifted from the current Pydantic schemas (e.g. active_usage_hours vs active_usage_hours_per_week, function/sciper vs position_title/position_category/user_institutional_id). It also wasn't reachable from make seed-data (commented-out call with a typo in the module path), and seed_all.py had broken imports.

This change:

Targets ~800 data_entry rows total (was unbounded, max ≈ 8M) so local dev / smoke testing stays fast.
Realigns every builder payload to the current create-DTO field names.
Splits the plane/train builder so each writes the IATA / station-name fields its schema requires.
Makes DATA_ENTRY_TYPE_TO_DTO exhaustive over MODULE_TYPE_TO_DATA_ENTRY_TYPES, so the generator's random pick can never KeyError mid-batch.
Enables the per-row Pydantic validation (was commented out) — drift now surfaces in the seeder, not on first read by the API.
Adds make seed-data-random and a regression smoke test.

Row-count math¶

seed_carbon_reports creates (unit × year × module_type) modules; the data entries seeder loops modules and appends randint(MIN, MAX) entries each:

total_rows ≈ NUM_UNITS × len(YEARS) × len(ALL_MODULE_TYPE_IDS) × avg(entries/module)
           = 5         × 3          × 8                       × 7                  = 840

NUM_UNITS reduced from 300 → 5 in app/seed/random_generator/populate_units_and_users.py.
NUM_USERS reduced from 1000 → 40 (must satisfy 3 × NUM_UNITS ≤ NUM_USERS ≤ 15 × NUM_UNITS so distribute_users converges).
YEARS = [2023, 2024, 2025] (unchanged in seed_carbon_reports.py).
entries_per_module window changed from randint(10, 220) (avg 115) → randint(4, 10) (avg 7) in seed_data_entries.py. The bounds are pulled out as ENTRIES_PER_MODULE_MIN/MAX module constants and asserted by the smoke test (test_entries_per_module_window_targets_800_rows).

Target: 800 ±100. Actual expected mean: 840.

Drift fixes (per builder)¶

Builder	Before	After
`build_professional_travel` (single)	`traveler_name`, `origin_location_id`, `destination_location_id`, `transport_mode`	Split into `build_plane_travel` (writes `user_institutional_id`, `origin_iata`, `destination_iata`, `cabin_class` ∈ {eco,business,first}) and `build_train_travel` (writes `user_institutional_id`, `origin_name`, `destination_name`, `cabin_class` ∈ {first,second})
`build_equipment`	`equipment_class` Optional, `active_usage_hours`, `passive_usage_hours`, sum unbounded	`equipment_class` required, renamed to `active_usage_hours_per_week`/`standby_usage_hours_per_week`, sum capped at 168 to satisfy `_EquipmentUsageHoursValidationMixin`
`build_headcount`	`function`, `sciper`, missing `user_institutional_id`	`position_title`, `position_category` (from `POSITION_CATEGORY_VALUES`), required `user_institutional_id`
`build_external_cloud`	`provider` wrapped in `maybe()`	`provider` always present, adds `currency` ∈
`build_external_ai`	`requests_per_user_per_day` was an int	Drawn from `REQUESTS_FREQUENCY_OPTIONS` string enum; `fte_count` ≥ 0.1 per validator
`build_purchase`	`total_spent_amount` wrapped in `maybe()`	Required; `currency` added
(new) `build_purchase_additional`	—	`name`, `unit`, `annual_consumption`, `coef_to_kg`
(new) `build_building_room`	—	Required `building_name`, `room_name`; `room_type` from `VALID_ROOM_TYPES`; ratio ∈ [0,1]
(new) `build_energy_combustion`	—	Required `name`, `quantity` ≥ 0
(new) `build_building_embodied_energy`	—	Required `building_name`
(new) `build_process_emissions`	—	Required `category`, `quantity` ≥ 0
(new) `build_research_facility_common`	—	All-optional payload matching `ResearchFacilitiesCommonHandlerCreate`
(new) `build_research_facility_animal`	—	Common payload + `researchfacility_type`

DATA_ENTRY_TYPE_TO_DTO was also wrong on three rows — building → EquipmentHandlerCreate, process_emissions → EquipmentHandlerCreate, scientific_equipment → EquipmentHandlerCreate (should all be the module-native DTO). The rewritten map now covers every reachable DataEntryTypeEnum value, asserted by test_dto_map_covers_every_reachable_data_entry_type.

`data_entry_emissions` schema drift¶

Past the per-row payload drift, the emissions writer was also writing two columns that no longer exist on the table:

Column	Old seeder	Current `DataEntryEmissionBase`
`subcategory` (TEXT)	written	removed (emission_type.path is the source of truth)
`formula_version` (TEXT)	written as top-level column	folded into `meta.formula_version`
`additional_value` (FLOAT)	not written	new nullable polymorphic quantity
`scope` (INT)	not written	scope id (1/2/3 or NULL on rollups)

copy_insert_emissions now creates a tmp table whose columns match the live schema in order, and generate_emissions_for_entry emits an 8-tuple in the same order (entry_id, emission_type, primary_factor_id, kg_co2eq, additional_value, scope, meta, computed_at). formula_version is preserved inside meta so the seeded trace stays auditable.

This drift was the second crash uncovered while running the generator against a real DB; without the fix seed-data-random would UndefinedColumnError on the very first emissions batch.

Enabled per-row validation¶

seed_data_entries.py:generate_data_entries_for_module previously commented out the Pydantic-DTO instantiation. With the drift fixes in place, the seeder now does:

dto_instance = dto_class(
    data_entry_type_id=data_entry_type.value,
    carbon_report_module_id=module_id,
    **payload_dict,
)
rows.append((..., json.dumps(dto_instance.data, default=str), ...))

DataEntryPayloadMixin.unflatten_payload wraps the flat builder dict into {"data": {...}}; we persist dto_instance.data so the JSONB column matches the API-write shape. Any future builder/schema drift now blows up at seed time with a Pydantic ValidationError, not silently as a stale JSON column.

Makefile change¶

Existing seed-data target left untouched (it still runs the small CSV / locations / building-rooms / factors seeders).

Added a dedicated target so contributors can opt in to the heavier random data without changing existing flows:

.PHONY: seed-data-random
seed-data-random: ## Seed ~800 random data_entry rows via Faker (issue #222)
    $(UV) run -m app.seed.random_generator.seed_all

seed_all.py itself was rewired:

Imports now point at app.seed.random_generator.* (were app.seed.*, which didn't resolve).
The previously-commented populate_units_and_users call is now active — the random orchestrator no longer depends on a hand-seeded users table.

User & lab seeding (existing, no net-new)¶

The pre-existing populate_units_and_users.py already handles labs (units) and users via asyncpg COPY, including admin-role grants. That code path covers the success criterion "Seed data are generated for user and labs" in issue #222; this change only resizes its constants.

Regression smoke test¶

New: backend/tests/unit/seed/test_random_generator_builders.py. Runs without a DB and would catch every drift fix above. 18 parametrized cases, all green:

For each (dto_class, builder) pair: 50 generated payloads must validate.
Every reachable DataEntryTypeEnum is in DATA_ENTRY_TYPE_TO_DTO.
Every DTO in the map has a registered builder.
The configured ENTRIES_PER_MODULE_MIN/MAX window stays in the 800 ±100 band.

Files touched¶

backend/app/seed/random_generator/seed_data_entries.py — builders + DTO map rewritten, validation enabled, row count tuned.
backend/app/seed/random_generator/populate_units_and_users.py — NUM_UNITS=5, NUM_USERS=40.
backend/app/seed/random_generator/seed_all.py — fixed imports, wired in populate_units_and_users, dropped unused clean-data hook.
backend/Makefile — added seed-data-random target, removed stale typo'd comment from seed-data.
backend/tests/unit/seed/test_random_generator_builders.py — new regression net.

Follow-up: closing remaining success criteria¶

Landed after the initial PR was merged to dev:

Class/sub_class driven by factors — build_equipment() now draws (equipment_class, sub_class) from the live factors table (loaded once at seed_data_entries.main() startup into _EQUIPMENT_CLASS_POOL). Falls back to fake.word() only when the pool is empty (unit tests). Regression test test_build_equipment_uses_factor_pool_when_populated pins this.
User-guide doc — docs/src/backend/11-SEED-DATA.md covers the two seed pipelines, the login-test perimeter map, and the volume tuning constants.

Out of scope¶

The Faker generator still skips data_entry_emissions factor lookups (random EmissionType + null primary_factor_id); aligning emissions to real factor rows is a follow-up if/when benchmarks need it.
data_entries.data rows reference user_institutional_id values that do not match the seeded users.institutional_id set — the seed flow has no FK between the JSON payload and the users table, so this is cosmetic. Closing the loop is a follow-up (would require sampling from unit_user_rows rather than randint).