Data Ingestion Recipe Parsing Workflows

CSV Bulk Import Automation

Multi-unit restaurant operators live and die by the spreadsheet. Culinary teams distribute weekly recipe updates, ingredient substitutions, and seasonal menu changes as CSV exports from distributor portals, recipe-management platforms, and hand-maintained pricing sheets — and every one of those files has to reach the costing engine in a shape it can trust. This page sits inside the broader Data Ingestion & Recipe Parsing Workflows domain and scopes down to one specific failure surface: turning heterogeneous, human-authored CSV files into a validated, unit-normalized, atomically committed cost dataset without introducing silent drift. Get this boundary wrong and every downstream margin number is mathematically unsound; the failure stays invisible until a monthly reconciliation surfaces a variance nobody can explain.

The problem is not “read a CSV.” It is enforcing a data contract on files that were never designed to satisfy one — inconsistent decimal separators, locale-specific number formatting, missing yields, ambiguous units, and legacy SKU formats — while keeping the pipeline fast enough to process tens of thousands of rows during an off-peak window. This is a deterministic ingestion problem, and it is solved with a strict schema gate, a canonical unit layer, streaming execution, and transactional persistence, in that order.

Data Contract: Inputs, Constraints, and the Output Guarantee

Before any code runs, the contract has to be explicit. The input is a CSV whose rows each describe one recipe line item: a location_id, a sku, an ingredient_name, a raw purchase quantity and its unit_of_measure, a vendor_cost, and an optional yield_pct. The constraints are that location_id matches a known site pattern, sku is within a fixed length band, quantities and costs are non-negative decimals, units belong to a closed vocabulary, and yield is a percentage in (0, 100]. The output guarantee is the load-bearing part: every row that leaves this subsystem is either committed to recipe_cost_staging in canonical base units with a computed cost_per_unit, or it is quarantined with a machine-readable reason code. Nothing is silently coerced, dropped, or zero-filled.

That output contract is what lets the rest of the estate build on this layer. The cost roll-up in core architecture and cost mapping systems consumes only validated staging rows, and the higher-frequency weekly CSV menu update workflow reuses the same schema gate rather than re-implementing validation per feed. A CSV that fails the contract never reaches a margin calculation.

Architecture Decision: Why a Hard Schema Gate and a Canonical Unit Layer

Two decisions define this pipeline, and both are deliberate choices against a more permissive alternative.

The first is reject-at-the-boundary over coerce-and-continue. It is tempting to let pandas infer dtypes and fill blanks, because it always “works” on the happy path. But implicit coercion is exactly how a European 1,50 becomes 1 instead of 1.50, how a blank yield becomes NaN and poisons a division, and how a stray header row becomes a phantom SKU. A hard validation contract trades a slightly noisier ingest — some rows get bounced back to the culinary team — for the guarantee that whatever passes is arithmetically sound. On a financial dataset that trade is not close.

The second is canonicalize units at ingestion, not at calculation. The same ingredient arrives as lbs, kg, oz, or case depending on which distributor exported the sheet. If the pipeline stores those raw units and defers conversion, every downstream consumer has to re-implement the unit matrix, and they will inevitably diverge. Converting once, at the boundary, to a base unit (grams / milliliters) means the calculation engine only ever sees one unit system. This is the same unit canonicalization discipline the cost architecture depends on, and it removes an entire class of double-conversion bugs from the hot path.

A third, smaller decision worth naming: Polars streaming over row-by-row pandas iteration. Files exceeding 50,000 rows will exhaust memory if materialized and looped in Python. Lazy scanning with a column-vectorized transform keeps a flat memory footprint and pushes the work into the query engine. For feeds large enough to need concurrency across many files, the import is handed off to the async batch processing workflow rather than run inline.

Phase 1 — Schema Enforcement and Validation Contracts

Raw CSV exports rarely align with the internal costing database. Missing headers, mixed decimal separators, locale number formatting, and unstandardized yields introduce silent errors that compound across thousands of SKUs. The ingestion layer opens with a Pydantic v2 validation contract that rejects malformed rows before they enter the transformation stage. Monetary and quantity fields parse through Decimal so no binary-float rounding drift enters at the door.

from pydantic import BaseModel, Field, field_validator
from decimal import Decimal
from typing import Optional

class RecipeLineItem(BaseModel):
    location_id: str = Field(..., pattern=r"^[A-Z]{3}-\d{4}$")
    sku: str = Field(..., min_length=4, max_length=20)
    ingredient_name: str
    raw_quantity: Decimal = Field(..., gt=0)
    unit_of_measure: str = Field(..., pattern=r"^(lbs|kg|oz|g|ml|l|case|ea)$")
    vendor_cost: Decimal = Field(..., ge=0)
    yield_pct: Optional[Decimal] = Field(default=100.0, ge=0, le=100)

    @field_validator("vendor_cost", "raw_quantity", mode="before")
    @classmethod
    def normalize_decimals(cls, v: str | float) -> Decimal:
        if isinstance(v, str):
            v = v.replace(",", ".")
        return Decimal(str(v))

This layer is a hard gate. Rows failing type coercion, regex constraints, or logical bounds are quarantined to a structured error log rather than silently coerced. The unit_of_measure pattern doubles as a closed vocabulary — an unexpected unit is a validation failure, not a guess. Deterministic rejection prevents downstream food cost inflation and gives culinary teams precise line-item feedback for remediation instead of a mystery variance three weeks later.

Phase 2 — Deterministic Unit Normalization and Error Routing

Once a row passes schema validation, the pipeline resolves unit discrepancies against a centralized, version-controlled conversion matrix. A single ingredient might appear as lbs, kg, oz, or case; the costing engine expects a base unit. case is deliberately not in the static matrix because a case count depends on pack size, so it requires an explicit multiplier and raises rather than assuming a default — a missing multiplier is a routed error, not a silent zero.

from decimal import Decimal
from typing import Optional

# Centralized conversion matrix (grams/ml base)
UNIT_CONVERSIONS = {
    "g": 1.0, "kg": 1000.0, "oz": 28.3495, "lbs": 453.592,
    "ml": 1.0, "l": 1000.0, "ea": 1.0,  # 'case' requires dynamic lookup
}

def normalize_to_base_unit(quantity: Decimal, uom: str, case_multiplier: Optional[int] = None) -> Decimal:
    if uom == "case":
        if not case_multiplier or case_multiplier <= 0:
            raise ValueError("Case UOM requires valid item multiplier")
        return quantity * Decimal(case_multiplier)
    factor = UNIT_CONVERSIONS.get(uom.lower())
    if not factor:
        raise ValueError(f"Unsupported unit: {uom}")
    return quantity * Decimal(str(factor))

Error routing is the other half of this phase. A validation or normalization failure does not abort the batch — it peels the offending row into a structured quarantine table carrying the raw payload, the failing field, a reason code, and the source file hash. That quarantine record is what a culinary manager reviews and re-submits, and its existence is what keeps a single bad row from blocking an otherwise-clean weekly import. The conversion matrix itself is version-controlled and audited quarterly, because distributor packaging weights and regional measurement standards drift, and an un-audited matrix silently mis-costs every SKU it touches. When reconciling spreadsheet units against OCR-derived measurements from the PDF recipe extraction pipelines, the same matrix provides the shared semantic target so both ingestion vectors converge on identical base units.

Phase 3 — Streaming Transformation and Downstream Handoff

The bulk processing routine uses chunked, streaming execution to handle large files without exhausting memory. Polars lazy evaluation builds a query plan — casting, filtering, yield adjustment, and cost-per-unit derivation — and materializes it in deterministic slices. Each chunk is deduplicated, cost-mapped, and yield-adjusted before being staged.

import polars as pl
from typing import AsyncIterator

# `persist_to_staging` is defined in the next block.

async def process_csv_stream(file_path: str, chunk_size: int = 10_000) -> AsyncIterator[pl.DataFrame]:
    # Polars lazy evaluation for memory-efficient query planning
    lazy_df = pl.scan_csv(file_path)
    # Apply schema casting, filtering, and unit normalization in the query plan
    normalized = lazy_df.with_columns([
        pl.col("raw_quantity").cast(pl.Float64),
        pl.col("vendor_cost").cast(pl.Float64),
        pl.col("yield_pct").fill_null(100.0).cast(pl.Float64)
    ]).filter(
        (pl.col("raw_quantity") > 0) & (pl.col("vendor_cost") >= 0)
    )

    # Collect and yield in deterministic slices
    df = normalized.collect()
    for start_idx in range(0, len(df), chunk_size):
        yield df.slice(start_idx, chunk_size)

async def batch_transform_chunks(file_path: str):
    async for chunk in process_csv_stream(file_path):
        # Apply deduplication, yield adjustment, and vendor cost mapping
        transformed = chunk.with_columns([
            (pl.col("raw_quantity") * (pl.col("yield_pct") / 100.0)).alias("net_quantity"),
        ]).with_columns([
            (pl.col("vendor_cost") / pl.col("net_quantity")).alias("cost_per_unit")
        ])
        await persist_to_staging(transformed)

The net_quantity derivation applies the submitted yield to translate a purchased weight into the usable portion the recipe actually consumes — the same principle formalized in the yield factor calculation frameworks. Streaming execution aligns with the async patterns behind implementing Celery for async menu syncs, enabling concurrent I/O for vendor lookups and database writes without blocking the event loop.

The handoff itself is transactional. The validated dataset is staged into the menu-engineering database inside a single atomic transaction that rolls back entirely if any row breaches a constraint or a unique-key conflict arises. This staging discipline is what lets live sales telemetry from the POS API polling strategies domain reference updated cost baselines without ever reading a partial write.

import asyncpg
import polars as pl
from contextlib import asynccontextmanager

@asynccontextmanager
async def atomic_db_transaction(dsn: str):
    conn = await asyncpg.connect(dsn)
    try:
        await conn.execute("BEGIN")
        yield conn
        await conn.execute("COMMIT")
    except Exception:
        await conn.execute("ROLLBACK")
        raise
    finally:
        await conn.close()

async def persist_to_staging(df: pl.DataFrame):
    rows = [
        (r["location_id"], r["sku"], r["net_quantity"], r["cost_per_unit"])
        for r in df.to_dicts()
    ]
    async with atomic_db_transaction("postgresql://user:pass@host/menu_engineering") as conn:
        await conn.executemany(
            """
            INSERT INTO recipe_cost_staging (location_id, sku, net_quantity, cost_per_unit, import_ts)
            VALUES ($1, $2, $3, $4, NOW())
            ON CONFLICT (location_id, sku) DO UPDATE SET
                net_quantity = EXCLUDED.net_quantity,
                cost_per_unit = EXCLUDED.cost_per_unit,
                import_ts = NOW()
            """,
            rows,
        )

The ON CONFLICT ... DO UPDATE upsert makes the whole import idempotent: re-running the same file overwrites stale costs on the natural key (location_id, sku) without creating duplicates or triggering premature margin recalculation. The staging column is declared NUMERIC so the value the roll-up reads is exact base-10, and rounding happens only at the reporting boundary, never mid-pipeline.

Production Hardening

An import that is correct on a clean file but brittle under real feed conditions will still erode trust. The following controls keep bulk ingestion dependable at estate scale.

Idempotency keys. Beyond the natural-key upsert, hash each source file (SHA-256 over the raw bytes) and record it before processing. A re-submitted identical file is a no-op; a changed file with the same name is caught because its hash differs. This is the same payload-hash discipline used across the ingestion domain.
Deduplication within a batch. Distributor exports frequently repeat a SKU across multiple sheet sections. Deduplicate on (location_id, sku) inside the chunk — keeping the last occurrence — before the upsert, so intra-file duplicates do not fight each other row-by-row inside a single transaction.
Memory discipline. Cap concurrent chunk workers by available RAM, keep the Polars scan lazy, and cast wide string columns to categorical dtype. A single nightly run should never approach an OOM kill on legacy infrastructure.
Bounded retries. Wrap external vendor lookups and DB commits with exponential backoff, and retry only idempotent-safe failures (HTTP 429/503, transient connection drops). Permanent failures — constraint violations, schema breaches — route straight to quarantine and never retry blindly.
Auditability. Log row counts, validation-failure rates, the file hash, and the final committed SKU set per run, and hold raw imports for 90 days to support financial reconciliation. Structured JSON logs carrying a batch_id let a quarantined row be traced from ingest to commit without grepping free text.
Scheduling. Run imports during off-peak windows when POS traffic is minimal, coordinated with the weekly CSV menu update schedule so cost baselines settle before the next sales day opens.

Failure Modes and Troubleshooting

Most CSV import failures are silent — the job reports success while quietly corrupting cost. These are the patterns to detect deliberately.

Divide-by-zero on yield. A yield_pct of 0 (or a blank that fills to 0 instead of 100) makes net_quantity zero and cost_per_unit explode to infinity or NaN. The schema gate’s gt=0 / le=100 bound is the primary defense; treat any post-transform null or inf in cost_per_unit as a quarantine trigger, never a committable row.
Decimal-separator drift. A locale that writes 1.234,56 will be mis-parsed as 1.234 unless normalized. The normalize_decimals validator handles the common ,→. case, but thousands separators need explicit handling per feed — flag any vendor_cost that lands more than an order of magnitude off its historical value.
Silent float coercion. The Polars transform casts to Float64 for speed, which is fine for filtering and yield math but must not be the type that reaches a financial report. Persist to NUMERIC and let the downstream roll-up own exact arithmetic; keep floats confined to the transform stage.
Phantom SKUs from stray rows. Header repetitions, footer totals, and merged-cell artifacts sneak in as rows that pass loose parsing. The sku length band and location_id pattern reject most; surface unmapped SKUs against the master registry before commit rather than after, mirroring the indicator=True merge discipline used in the parent pipeline.
Partial-write races. Without the atomic transaction, a mid-batch failure leaves the staging table half-updated and a concurrent reader sees inconsistent costs. The all-or-nothing BEGIN/COMMIT/ROLLBACK wrapper is what prevents this; a failed batch must leave zero committed rows.
Case-pack ambiguity. A case unit with no pack multiplier is the most common normalization failure. It raises by design — resolve it by joining the SKU to a pack-size lookup before ingest, not by defaulting the multiplier to 1.

Once staged, these cost baselines feed the variance mapping methodologies that separate genuine operational drift from ingest artifacts — which is only possible when the import layer guarantees that a variance is real and not a parsing bug.

Frequently Asked Questions

Why validate with Pydantic instead of letting pandas infer types on read?

Type inference optimizes for “the read never fails,” which is the opposite of what a financial pipeline needs. Inference will happily turn 1,50 into 1, a blank yield into NaN, and a footer total into a data row — all without an error. A Pydantic contract makes every constraint explicit and turns a bad row into a routed, reviewable quarantine record instead of a silent miscalculation. You trade a noisier ingest for arithmetic you can defend in an audit.

Should unit conversion happen at import or at cost-calculation time?

At import. Converting once, at the boundary, means the calculation engine only ever sees one unit system, which removes double-conversion and region-alias bugs from the hot path. Deferring conversion forces every downstream consumer to re-implement the same matrix, and they inevitably diverge. The canonical base-unit dataset is also what makes costs directly comparable across recipes and locations.

Why Polars streaming rather than reading the whole file with pandas?

Files above roughly 50,000 rows will exhaust memory if fully materialized and iterated in Python. Polars lazy scanning builds a vectorized query plan and yields deterministic slices, holding a flat memory footprint regardless of file size. Row-by-row pandas iteration is both slower and memory-heavier, and it invites the exact per-row bugs the vectorized transform avoids. For many files at once, hand the work to the async batch layer.

What makes the import idempotent so a re-run does not double-count?

Two things. The staging upsert keys on the natural (location_id, sku) pair with ON CONFLICT ... DO UPDATE, so a repeated row overwrites rather than duplicates. And each source file is hashed before processing, so an identical re-submission is a no-op. Together they mean a retried or double-fired import converges to the same state instead of inflating cost.

How is a divide-by-zero from a zero yield prevented?

The schema gate bounds yield_pct to (0, 100], and blanks default to 100 rather than 0, so net_quantity can never be zero on a validated row. As a second line of defense, any post-transform cost_per_unit that comes back null or inf is treated as a quarantine trigger and never committed. A zero-yield row is a data error to be reviewed, not a value to be stored.

Automating Weekly CSV Menu Updates — the high-frequency delta workflow built on this import layer.
Async Batch Processing Workflows — concurrent execution for high-volume, many-file imports.
PDF Recipe Extraction Pipelines — the other ingestion vector that converges on the same canonical schema.
Yield Factor Calculation Frameworks — translating raw purchase weights into usable edible portions.
Variance Mapping Methodologies — the downstream analytics that these staged costs feed.

Up one level: Data Ingestion & Recipe Parsing Workflows.

For deeper implementation reference, consult the official Pydantic documentation on validators and strict types, and the Polars documentation on lazy evaluation and streaming.