Core Architecture Cost Mapping Systems

Calculating Trim and Yield Factors for Produce

This page walks a food-tech developer or culinary data engineer through the exact log schema, pandas transforms, and validation gates that turn raw prep-scale readings into a single, defensible yield factor per produce SKU per location. It is the hands-on implementation companion to the broader yield factor calculation frameworks; read that first for the architectural rationale on why as-purchased (AP) and edible-portion (EP) weights must reconcile to the same canonical unit, then follow the numbered steps here to stand up a working produce yield pipeline you can run against this week’s prep logs.

The task this solves is narrow and high-leverage: a head of romaine arrives at 620 g on the invoice but plates at 430 g after coring, trimming, and outer-leaf loss. That 190 g gap is real cost that never reaches a guest, and if the recipe BOM cost roll-up uses the AP weight it will understate food cost on every salad. The yield factor closes that gap deterministically instead of by chef intuition.

Prerequisites and Data Contract

Every step below is written against the following environment and structural assumptions. If they drift, the transforms will silently produce a plausible-but-wrong factor rather than erroring.

Python 3.11+, pandas 2.x, and numpy 1.26+.
PostgreSQL 13+ to hold the append-only prep yield log. Weights are stored as NUMERIC, never binary floats.
One weighing discipline per SKU: the same physical stage boundary (post-trim, pre-cook) is captured every time, so the ratio measures trim loss and not thermal shrink. Cook-loss yield belongs to a separate factor and a separate pipeline.
All weights normalized to a single base unit (grams) at ingestion. Regional aliases (oz, lb, each) are canonicalized upstream by the unit canonicalization layer, never pattern-matched here.

The data contract is a single append-only table. Each row is one prep event: a scale reading of what went onto the board (ap_weight_g) and what came off it as usable product (ep_weight_g), tagged by SKU, location, and date. Nothing is aggregated at write time — the log stays a raw, auditable record and every factor is derived on read.

CREATE TABLE prep_yield_log (
    event_id      BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    produce_sku   TEXT        NOT NULL,
    location_id   TEXT        NOT NULL,
    prep_date     DATE        NOT NULL,
    ap_weight_g   NUMERIC(10,2) NOT NULL CHECK (ap_weight_g > 0),
    ep_weight_g   NUMERIC(10,2) NOT NULL CHECK (ep_weight_g >= 0),
    prep_method   TEXT        NOT NULL DEFAULT 'standard',
    logged_by     TEXT        NOT NULL,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_yield_cohort ON prep_yield_log (produce_sku, location_id, prep_date);

Step-by-Step Implementation

The yield factor is not a static constant. It is a time-bound, location-specific metric that must survive seasonal moisture variance, supplier substitutions, and prep-method drift. The steps below build that resilience one gate at a time. Each block is self-contained and runs against a DataFrame loaded from prep_yield_log.

Step 1 — Load the log and pin types

Read the raw events into a typed frame and sort within each cohort so the later rolling window is temporally coherent. Sorting once here means no step downstream has to re-sort.

import pandas as pd
import numpy as np

COHORT = ["produce_sku", "location_id"]

def load_yield_log(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["prep_date"] = pd.to_datetime(df["prep_date"])
    df["ap_weight_g"] = pd.to_numeric(df["ap_weight_g"], errors="coerce")
    df["ep_weight_g"] = pd.to_numeric(df["ep_weight_g"], errors="coerce")
    return df.sort_values([*COHORT, "prep_date"]).reset_index(drop=True)

Step 2 — Apply the physical-constraint gate

This is the first and most important gate. Any row where AP weight is non-positive, EP is negative, or EP exceeds AP is physically impossible — it signals a scale calibration fault or a transposed reading. Such rows get a NaN raw yield so they never poison the aggregate, while valid rows get the deterministic ratio.

def compute_raw_yield(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    invalid = (
        (df["ap_weight_g"] <= 0)
        | (df["ep_weight_g"] < 0)
        | (df["ep_weight_g"] > df["ap_weight_g"])
    )
    df["raw_yield"] = np.where(
        invalid, np.nan, df["ep_weight_g"] / df["ap_weight_g"]
    )
    df["is_quarantined"] = invalid
    return df

Step 3 — Cap outliers per SKU-location cohort

A single mis-portioned batch should not swing the factor. Clip each cohort’s raw yields to ±2.5 standard deviations, but only once the cohort has enough observations to make a spread meaningful. Cohorts with fewer than three valid points pass through untouched — you cannot detect an outlier in noise.

SIGMA_CAP = 2.5

def cap_cohort_outliers(s: pd.Series) -> pd.Series:
    valid = s.dropna()
    if valid.shape[0] < 3:
        return s
    mu, sigma = valid.mean(), valid.std()
    if sigma == 0 or pd.isna(sigma):
        return s
    return s.clip(lower=mu - SIGMA_CAP * sigma, upper=mu + SIGMA_CAP * sigma)

def cap_outliers(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["capped_yield"] = df.groupby(COHORT)["raw_yield"].transform(cap_cohort_outliers)
    return df

Step 4 — Smooth with a rolling median

A median over a trailing window absorbs the day-to-day skew that a mean would amplify. A 30-day window with a five-observation floor lets a supplier switch (whole-case to pre-cut) or a seasonal moisture shift bleed in gradually instead of snapping. Where the window is not yet populated, fall back to the most recent capped value rather than extrapolating from sparse data.

ROLLING_WINDOW = 30
MIN_OBS = 5

def roll_yield_factor(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["rolling_median_yield"] = df.groupby(COHORT)["capped_yield"].transform(
        lambda x: x.rolling(ROLLING_WINDOW, min_periods=MIN_OBS).median()
    )
    df["final_yield_factor"] = np.where(
        df["rolling_median_yield"].notna(),
        df["rolling_median_yield"],
        df["capped_yield"],
    )
    return df

Step 5 — Derive trim loss and select the current factor

Trim loss is simply the complement of the yield factor. Take the last row per cohort as the live factor that the costing engine reads. This is the value that adjusts every recipe drawing on that SKU.

def finalize(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["trim_loss_factor"] = 1.0 - df["final_yield_factor"]
    current = (
        df.dropna(subset=["final_yield_factor"])
        .groupby(COHORT, as_index=False)
        .last()
    )
    return current[[*COHORT, "prep_date", "final_yield_factor", "trim_loss_factor"]]

pipeline = load_yield_log(raw_events)
pipeline = compute_raw_yield(pipeline)
pipeline = cap_outliers(pipeline)
pipeline = roll_yield_factor(pipeline)
current_factors = finalize(pipeline)
print(current_factors.to_markdown(index=False))

The whole chain stays vectorized — groupby and broadcasting keep it at O(n log n) even across millions of scale transactions, and no step mutates external state, so the run is idempotent and safe to schedule nightly.

Verification and Validation

Confirm each gate before trusting a factor downstream.

Impossible rows are quarantined, not dropped. Assert the constraint gate caught what it should: assert pipeline.loc[pipeline["is_quarantined"], "raw_yield"].isna().all(). A quarantined count that trends upward for one SKU points at a drifting scale, not bad produce.
Factors land in a sane band. Every final_yield_factor must satisfy 0 < yf <= 1. For a leafy green, expect roughly 0.60–0.75; a value near 1.0 means someone logged EP into the AP column. Spot-check with current_factors.query("final_yield_factor > 0.98").
Sparse cohorts fall back correctly. For a SKU with fewer than five observations, final_yield_factor should equal its last capped_yield, not NaN. This proves the pipeline degrades gracefully on a new item instead of emitting nothing.
Idempotency. Run the full chain twice on the same snapshot and compare with pd.testing.assert_frame_equal(current_factors, current_factors_rerun). Identical input must yield identical output.

Gotchas and Edge Cases

yield_factor = 0 divide-by-zero downstream. Effective EP cost is ap_cost / yield_factor. If a bad row produces a zero factor, that division explodes. The physical-constraint gate prevents a zero factor from ever leaving this pipeline, but any consumer should still guard the division — never trust an upstream invariant you can cheaply re-check.
IEEE-754 drift on the money path. The weight ratios here are floats, which is fine for yields. The moment final_yield_factor is multiplied into a unit_cost, switch to decimal.Decimal or PostgreSQL NUMERIC so sub-cent error does not accumulate across a month of covers. That conversion happens in the variance mapping methodologies layer, not here.
Conflating trim loss with waste. This factor measures the yield of a correct prep. Over-trimming, spoilage, and dropped batches are a different signal that belongs in the waste tracking and routing systems. Folding them into the yield factor hides recoverable loss inside a “normal” number and removes the incentive to fix technique.
Cross-location contamination. Never aggregate a factor across locations. A dull knife or a colder walk-in in one store is a real, location-specific cost that cohort isolation surfaces; averaging it away is exactly the margin blindness the multi-location cost center architecture exists to prevent.
Prep-method mixing. A hand-peeled carrot and a machine-peeled carrot are different SKUs for yield purposes even if they share an invoice line. If prep_method varies within a cohort, split it before Step 3 or the median will straddle two real distributions.

Frequently Asked Questions

Why use a rolling median instead of a simple mean of all history?

A mean weights every historical point equally and is dragged by outliers, so a single mis-portioned batch or a stale reading from six months ago still moves today’s factor. A trailing median over a 30-day window is robust to skew and tracks real change — a seasonal moisture shift or a supplier switch — without letting one bad day dominate. It is the difference between a factor that reflects current reality and one that reflects an average of conditions that no longer exist.

How many observations before I trust an automated factor?

The pipeline needs at least five valid readings in the window before the rolling median engages; below that it falls back to the most recent capped value. In practice, treat a factor as provisional until a SKU-location cohort has accumulated ten to fifteen clean preps. Until then, cross-check against the culinary team’s expected band and flag anything outside it for manual review rather than shipping it straight into costing.

Should yield be applied here or during cost roll-up?

Derive the factor here and store it; apply it at the point where AP cost becomes EP cost inside the recipe roll-up. Keeping derivation and application separate means one canonical factor feeds every recipe that uses the SKU, and a recipe change never silently alters the measured yield. This is the same decoupling the recipe BOM database design uses to keep structure and cost independent.

What about thermal shrink from cooking?

This pipeline measures trim yield only — the AP-to-EP transition at the prep board. Cook loss is a separate transformation with its own drivers (temperature, time, method) and deserves its own factor and its own log. Multiplying a trim yield by a cook yield gives the true plate cost, but conflating them into one number destroys your ability to diagnose which stage is leaking margin.

Yield Factor Calculation Frameworks — the parent framework and AP-to-EP normalization rationale this produce pipeline implements.
Designing Recipe BOM Databases — the bill-of-materials schema that consumes final_yield_factor to compute theoretical cost.
Standardizing Portion Sizes Across Locations — how EP weights become enforced plate portions.
Variance Mapping Methodologies — the theoretical-vs-actual layer that multiplies these factors by Decimal costs.
Waste Tracking & Routing Systems — where over-trim and spoilage are attributed instead of hidden in the yield number.
Core Architecture & Cost Mapping Systems — the wider system this produce pipeline anchors into.

Up one level: Yield Factor Calculation Frameworks.

For deeper implementation reference, consult the official pandas documentation on GroupBy operations for optimizing memory during cohort transformations.