310-f Ingestion per-row efficiency¶
Status: proposed. Companion to [[310-e-async-loop-and-pipeline-worker]]. 310-e makes the loop yield (so liveness survives) and report progress; 310-f makes it faster — but only after measuring.
What the first cut got wrong¶
The original 310-f listed line-level "suspects" and an attempt was made to fix them. It moved the needle by ~0s. Two mistakes, both from assuming instead of measuring:
- Purchase rows aren't
memberrows. The batched-uniqueness win (#2) only replacescheck_institutional_id_unique, which fires only fordata_entry_type_id == member. Purchase common data isother_purchases/consumable_accessories/scientific_equipment/… — that path never executed. Dead optimization for this file. - The yield fix was never a speed fix. Yielding every 100 rows (310-e) stops the liveness restart; it does not make the job faster. "Still 34s" is the correct outcome of that change, not a failure.
Lesson: profile before optimizing. This rewrite makes that the first step.
Step 0 (shipped): a row-loop profiler¶
base_csv_provider now logs one diagnostic line at the end of the parse phase, on real data / real DB / real handlers. The actual line for the 9555-row purchase upload (dev, 2026-06-16):
Row-loop profile: 9555 rows in 34.5s (3.61 ms/row) | row=34.2s
[resolve=33.4 validate=0.2 enrich=0.0 populate=0.0 row_other=0.6]
loop_overhead=0.3s
Buckets (wall time, accumulated per row via time.perf_counter):
| Bucket | Covers |
|---|---|
resolve | _resolve_handler_and_validate (handler + type resolution) |
validate | handler.validate_create (Pydantic DTO) |
enrich | handler.enrich_csv_row (per-handler hook; DB for some) |
populate | ModuleHandlerService(...) + populate_defaults |
row_other | rest of _process_row: filter, factor-key build, DataEntry() |
loop_overhead | outside _process_row: CSV parse, blank skips, progress |
A DB-bound segment (hidden lazy-load N+1) shows as a large bucket dominated by await time; a CPU-bound segment shows as large under no concurrency. The breakdown tells which, without guessing.
Result: resolve was 33.4s of 34.5s — and the fix¶
The trace ended the guessing. Not Pydantic (validate=0.2), not DB (populate/enrich≈0): 97% of the time was resolve, in lookup_data_entry_type_by_kind (seed_helper.py). For every row it linearly scans all ~23k factor keys to infer the row's data_entry_type from its kind:
keys = [k for k in factors_map.keys() if k.__contains__(f":{kind}:")] # per type, per row
That's O(rows × factors) ≈ 9555 × 23k ≈ 220M substring checks ≈ 33s — so a CPU bump barely helps; it's an algorithm problem. (The original "23k factors" hunch was right, just via a per-row rescan, not memory.)
Fix (shipped): the inference is a pure function of (kind, subkind), which repeat heavily across rows, so memoise it per file on setup_result — O(rows × factors) → O(distinct kinds × factors). Same result, computed once per distinct kind. Mirrors the existing #1415 memoisation of the per-type split. Regression test: test_type_inference_is_memoised_per_kind.
Re-measured (2026-06-16): 34.5s → 1.7s — a 20× drop on the same 9555-row file (resolve 33.4s → 1.0s):
Row-loop profile: 9555 rows in 1.7s (0.17 ms/row) | row=1.6s
[resolve=1.0 validate=0.2 enrich=0.0 populate=0.0 row_other=0.5] loop_overhead=0.0s
The memo is sufficient at the current data shape — distinct kinds are few, so the per-distinct-kind scan that remains (the 1.0s) is cheap. Possible follow-up, not currently needed: if a file ever has near-unique kinds per row (memo ineffective), build a one-time kind → types index where the factor pools are computed and make each lookup O(1). lookup_data_entry_type_by_kind has exactly one caller, so that refactor is self-contained. Deferred — 1.7s is fine.
The method below stays on record for future buckets.
Step 1: read the trace, then pick the target¶
Run one real import (the purchase file that took 34s) and read the line. Then, and only then, optimize the dominant bucket. Candidate fixes per bucket — none are committed; the trace decides which (if any) apply:
| If dominant | Likely cause | Candidate fix |
|---|---|---|
validate | Pydantic re-validates every row | Lighter validation (model_construct for trusted fields), or validate-by-column-signature memo |
populate | ModuleHandlerService built per row (base_csv_provider.py:1298); FactorService in its ctor | Hoist the service out of the loop (build once, reuse) |
resolve | per-row handler/type lookup | Cache resolved handler by type for the file |
enrich | a handler doing a per-row DB lookup | Batch the lookup (pre-load once) — note: purchase handlers are likely no-op here |
row_other | filtered_row dict-comp + factor-key f-strings + DataEntry() per row | Hoist invariants; build keys with tuples/join |
loop_overhead | CSV re-parse / progress writes | Parse once; confirm progress is throttled |
Static reading (2026-06-15) found no explicit per-row DB query in the purchase path — factor lookup is in-memory by design, module id from an in-memory map, year from _year_cache, and populate_defaults reads an in-memory factor.values. So the prior is CPU-bound (Pydantic + object construction). The trace must confirm before any change — static reading cannot see a SQLAlchemy lazy-load N+1.
The dev CPU confound — measure on real headroom¶
The 34s was measured on a dev pod requesting 250m CPU. A 0.9 ms/row job at one core reads as ~3.6 ms/row at a throttled quarter-core. Before concluding "the code is slow," reproduce the trace where CPU isn't throttled (local, or a right-sized pod). Part of the win may be CPU allocation / the dedicated worker ([[310-e-async-loop-and-pipeline-worker]]), not constant-factor code work.
Success criteria¶
- A
Row-loop profileline captured before and after any change, same file, comparable CPU. - Target: dominant bucket ≥2× faster, ingest results byte-identical on a fixture (row counts, errors, emissions).
- No regression to the 310-e yields (loop stays responsive).
Out of scope¶
The structural move (worker deployment, threadpool offload) is in [[310-e-async-loop-and-pipeline-worker]]. This plan is constant-factor work on the in-loop path, gated on the trace.