Source Tracking Implementation Summary¶
Overview¶
Implemented source tracking for data entries to enable selective deletion based on upload method. This solves the problem of having three different CSV upload paths with different deletion behaviors:
- MODULE_UNIT_SPECIFIC: Add data entries (no deletion)
- MODULE_PER_YEAR: Replace/add ONLY data uploaded through module_per_year
- FACTORS: Drop all corresponding factors and insert
Changes Made¶
1. Database Migration¶
File: backend/alembic/versions/2026_03_18_1400-2026031801_add_source_tracking_to_data_entries.py
Added two columns to data_entries table:
source(Integer, nullable) - DataEntrySourceEnum valuecreated_by_id(Integer, nullable) - user.id or data_ingestion_job.id
Both columns are indexed for query performance.
2. Model Updates¶
File: backend/app/models/data_entry.py
Added DataEntrySourceEnum enum with values:
USER_MANUAL = 0- Manual entry via UICSV_MODULE_PER_YEAR = 1- CSV upload via module_per_year providerCSV_MODULE_UNIT_SPECIFIC = 2- CSV upload via module_unit_specific providerAPI_MODULE_PER_YEAR = 3- API upload for module per yearAPI_MODULE_UNIT_SPECIFIC = 4- API upload for unit specific moduleEXTERNAL_INTEGRATION = 5- Third-party integration or import
Added fields to DataEntry model:
source: Optional[DataEntrySourceEnum]created_by_id: Optional[int]
3. Repository Layer¶
File: backend/app/repositories/data_entry_repo.py
Added method:
async def bulk_delete_by_source(
self,
carbon_report_module_id: int,
data_entry_type_id: DataEntryTypeEnum,
source: DataEntrySourceEnum,
) -> None
4. Service Layer¶
File: backend/app/services/data_entry_service.py
Updated bulk_create() to accept:
source: Optional[DataEntrySourceEnum]created_by_id: Optional[int]
Added new method:
async def bulk_delete_by_source(
self,
carbon_report_module_id: int,
data_entry_type_id: DataEntryTypeEnum,
source: DataEntrySourceEnum,
user: Optional[UserRead] = None,
request_context: Optional[dict] = None,
background_tasks: Optional[BackgroundTasks] = None,
) -> None
5. CSV Provider Base Class¶
File: backend/app/services/data_ingestion/base_csv_provider.py
Added source tracking to _process_batch():
- Automatically determines source from
entity_type - Passes
sourceandcreated_by_idtobulk_create()
Added deletion logic for MODULE_PER_YEAR:
_delete_existing_entries_for_module_per_year()method- Called before processing new CSV data
- Deletes only entries with
source = CSV_MODULE_PER_YEAR - Preserves manual entries and unit-specific uploads
Added helper method:
def _get_source_from_entity_type(self) -> DataEntrySourceEnum | None
Behavior¶
MODULE_PER_YEAR CSV Upload¶
- Before processing: Delete all existing entries where
source = CSV_MODULE_PER_YEARfor affected modules - During processing: Set
source = CSV_MODULE_PER_YEAR,created_by_id = job.id - Result: Replaces only previous CSV uploads, preserves manual entries
MODULE_UNIT_SPECIFIC CSV Upload¶
- No deletion: Entries are added without removing existing data
- During processing: Set
source = CSV_MODULE_UNIT_SPECIFIC,created_by_id = job.id - Result: Cumulative additions
FACTORS CSV Upload¶
- Unchanged: Still uses
bulk_delete_by_data_entry_type()to drop all factors - No source tracking needed: Factors are reference data, not user data
Query Examples¶
-- Get all entries from a specific CSV job
SELECT * FROM data_entries WHERE created_by_id = 123;
-- Get all module_per_year CSV entries for a module
SELECT * FROM data_entries
WHERE carbon_report_module_id = 456
AND source = 1; -- CSV_MODULE_PER_YEAR
-- Count by source
SELECT source, COUNT(*)
FROM data_entries
GROUP BY source;
-- Get entries with unknown source (legacy data)
SELECT * FROM data_entries WHERE source IS NULL;
Testing¶
Migration applied successfully:
make db-migrate
# Output: Running upgrade 0367f025e8d8 -> 2026031801, add source tracking to data_entries
Model fields verified:
'source' in DataEntry.model_fields # True
'created_by_id' in DataEntry.model_fields # True
Code formatting passed:
make format
# Output: 244 files left unchanged, All checks passed!
Migration Notes¶
- Existing data has
NULLfor bothsourceandcreated_by_id - This is intentional - we don't guess the origin of legacy data
- New uploads will have proper source tracking
- Queries should handle
NULLvalues for backward compatibility
Future Enhancements¶
- API Upload Tracking: Update API endpoints to set
source = API_*andcreated_by_id = user.id - UI Exposure: Add source fields to admin/backoffice API responses
- Audit Enhancement: Include source in audit trail snapshots
- Data Migration: Optionally backfill source for existing data if audit logs provide clues
Files Modified¶
backend/alembic/versions/2026_03_18_1400-2026031801_add_source_tracking_to_data_entries.py(NEW)backend/app/models/data_entry.pybackend/app/repositories/data_entry_repo.pybackend/app/services/data_entry_service.pybackend/app/services/data_ingestion/base_csv_provider.py
Success Criteria¶
✅ Migration runs successfully
✅ Model fields added and accessible
✅ Repository method for source-based deletion
✅ Service layer supports source tracking
✅ CSV providers set source automatically
✅ MODULE_PER_YEAR deletes only same-source entries
✅ MODULE_UNIT_SPECIFIC adds without deletion
✅ Code passes formatting checks
✅ All lint errors resolved