Refactor Seed Factors to Use Ingestion Machinery
Plan: Refactor Seed Factors to Use Ingestion Machinery¶
TL;DR: seed_generic_factors.py manually duplicates CSV parsing, handler resolution, and factor creation logic that already exists in ModulePerYearFactorCSVProvider. The refactoring introduces a thin LocalFactorCSVProvider subclass that bypasses file-store/job-tracking (API-only concerns) and rewrites seed_factors() to delegate to it.
Phase 1 — Add deletion hook to BaseFactorCSVProvider (small, non-breaking)¶
File: base_factor_csv_provider.py
- Extract the
data_entry_type_to_iteratesblock inprocess_csv_in_batchesinto a new overridable method_get_types_to_delete(listed_entry_types). - This is needed because
purchases_common_factors.csvcovers 7 of 8 purchase types; module-type-based deletion would also eraseadditional_purchasesfactors seeded just before it. Scoping deletion to the explicit configured types preserves the current behavior.
Phase 2 — Create LocalFactorCSVProvider (new file)¶
New file: app/services/data_ingestion/csv_providers/local_seed.py
Extends ModulePerYearFactorCSVProvider. Config keys:
| Key | Value |
|---|---|
local_file_path | absolute path to seed CSV on disk |
module_type_id | derived via get_module_type_for_data_entry_type(config.data_entry_types[0]) |
data_entry_type_id | data_entry_types[0].value if no data_entry_type_column, else None |
year | 2025 (same default) |
explicit_entry_type_ids | [det.value for det in config.data_entry_types] — scopes deletion |
Override 4 methods:
validate_connection()— check local file exists (no file store)_setup_and_validate()— read CSV bytes from disk; skip file-store move + job DB updates; returnsetup_resultwithcsv_text+ handler info_finalize_and_commit()— skip file-store moves + job DB updates; just process last batch + flush session_get_types_to_delete()— returnexplicit_entry_type_idswhen provided; elsesuper()
Phase 3 — Refactor seed_generic_factors.py¶
File: seed_generic_factors.py
- Replace
seed_factors()body: derivemodule_type_idanddata_entry_type_idfromFactorSeedConfig, instantiateLocalFactorCSVProvider, callprovider.process_csv_in_batches(), print stats. - Remove:
get_float_str_or_none(), manual classification/value extraction (both replaced by the provider's_process_row+_convert_value). - Keep:
FactorSeedConfigdataclass,FACTOR_SEEDSlist,main()— no external interface changes.
Relevant Files¶
- seed_generic_factors.py — main file to refactor
- base_factor_csv_provider.py — add
_get_types_to_deletehook - factors.py —
ModulePerYearFactorCSVProviderto subclass - app/services/data_ingestion/csv_providers/local_seed.py — new file
- module_type.py —
get_module_type_for_data_entry_typeto derivemodule_type_id
Verification¶
uv run pytest tests/ -k "factor" -v— existing tests passuv run python -m app.seed.seed_generic_factors— run refactored seed; check output stats match expected countsmake lint && make type-check
Decisions¶
- Approach chosen: thin subclass adapter (
LocalFactorCSVProvider) — only adds a small hook to the base class, leaves the production ingestion path untouched. - Alternative rejected: uploading local CSVs to file store + creating
DataIngestionJobrecords — adds DB/storage overhead and requires the file store operational during seeding. LocalFactorCSVProviderlives incsv_providers/alongside the other provider implementations.- No
DataIngestionJobcreation — seeds have no job tracking. - Year remains hardcoded
2025in the seed.