Core Architecture Cost Mapping Systems

Mapping POS Taxonomies to Ingredients

The operational disconnect between point-of-sale transaction logs and actual ingredient consumption is the primary bottleneck in multi-unit food-cost control. A POS reports revenue by menu item, modifier category, or promotional bundle; the kitchen tracks inventory by raw-material SKU. Nobody can compute theoretical usage until those two vocabularies are reconciled, and every heuristic shortcut that papers over the gap injects silent cost drift that surfaces months later as unexplained margin leakage. This guide, part of the Core Architecture & Cost Mapping Systems framework, isolates one sub-problem: how to translate a POS sales taxonomy into standardized ingredient identifiers deterministically, so that every sold unit traces back to an auditable bill of materials before any cost allocation runs. A vendor-specific instance of this pattern is worked end-to-end in the companion walkthrough on mapping Toast POS categories to ingredient SKUs; here we define the data contract, the design decisions, and the four-phase transformation that turns noisy sales exports into a clean consumption ledger.

Concept Definition and Data Contract

The discrete unit of work is a stateless transformation layer we will call the POS-to-ingredient normalization pipeline. It ingests a daily sales export, applies a hierarchical mapping dictionary, and emits a structured ingredient-consumption ledger. It holds no state between runs: the same input snapshot must always produce the same output, which is what makes the numbers auditable.

The input contract is deliberately narrow. Each POS export row must supply:

A raw_category string — the menu-item or line-item label exactly as the POS emitted it, including any promotional tags or modifier text.
A units_sold integer — the quantity of that line sold in the period.
A location_id (UUID) and a business date, so consumption can be attributed to the correct cost center and day.

The exports themselves arrive through the ingestion side of the platform — typically an overnight pull governed by the POS API polling strategies that page vendor endpoints without tripping rate limits. This pipeline assumes those rows have already landed; its job is translation, not fetching.

The mapping contract is a version-controlled table that links each canonical POS category to one or more ingredient SKUs. Every row carries a pos_category, an ingredient_sku, a base_qty in canonical base units (NUMERIC, never a float), and a version tag. The table is authored by culinary managers and treated as immutable during a run.

The output contract is a single ledger with one row per (date, location_id, ingredient_sku): an adjusted_qty of the ingredient consumed and its unit_cost as NUMERIC. That ledger is exactly what the sales-side reconciliation joins against the theoretical cost table produced by designing recipe BOM databases, and it is the actual-usage input to downstream variance mapping methodologies. Any consumer reads the ledger without ever re-parsing a POS string.

Architecture Decision Rationale

The central decision is deterministic rule resolution versus fuzzy or model-driven matching. It is tempting to throw approximate string matching (or an LLM) at the messy category labels and let it guess the ingredient. We reject that as the primary path for one reason: food-cost math is money math, and a guess that is 95% accurate quietly misattributes 1 line in 20 forever. A single POS category resolving to the wrong SKU does not error — it produces a plausible-but-wrong consumption number that inflates or deflates margin with no stack trace to follow.

So the pipeline resolves every category through explicit, version-controlled rules and fails loud on anything it cannot map. Fuzzy matching is not banned outright, but it is demoted: it may only run against a bounded confidence threshold and its output is routed to a human review queue, never written straight to the ledger. This mirrors the same discipline applied at ingestion time by CSV bulk import automation, where unresolved units are quarantined rather than coerced.

The second decision is where the mapping table lives. It is deliberately kept out of application code as a versioned data artifact (Git or a config service), so a culinary edit to a recipe mapping is an auditable, revertible commit rather than a code deploy. Runtime treats it as read-only and immutable; the pipeline validates its schema before trusting a single row.

The third decision is batch, not inline. Consumption is computed once per location per day against a frozen snapshot, not recomputed on every dashboard read. High-volume portfolios dispatch the run through the async batch processing workflow so a slow location never blocks the others, and a retried job reproduces the identical ledger.

Phase 1 Implementation — Ingestion and String Normalization

POS exports contain inconsistent category strings, location-specific naming, and modifier tags that obscure true ingredient usage. The first step sanitizes those strings deterministically — controlled case folding, delimiter standardization, and removal of promotional artifacts — so that every downstream join sees a canonical key. This is pure vectorized pandas; no row-by-row iteration.

from __future__ import annotations

import re

import pandas as pd

# Promotional brackets, "@location" tags, and combo/bundle noise, plus stray punctuation.
_POS_ARTIFACT = re.compile(r"(?i)\s*(?:\[.*?\]|@\S+|promo|bundle|combo)\s*|[^\w\s\-]+")


def normalize_pos_strings(raw_categories: pd.Series) -> pd.Series:
    """Deterministic normalization of POS category labels.

    Strips promotional/location artifacts, folds case, and collapses whitespace.
    Vectorized end-to-end — no per-row Python calls.
    """
    return (
        raw_categories.astype("string")
        .str.strip()
        .str.lower()
        .str.replace(_POS_ARTIFACT, "", regex=True)
        .str.replace(r"\s+", " ", regex=True)
        .str.strip()
    )

Normalization establishes one ingestion protocol regardless of POS vendor. Fixing the string shape here — before any relational join — is what stops taxonomy drift across locations from fragmenting the mapping table into hundreds of near-duplicate keys.

Phase 2 Implementation — Composite Decomposition and Modifier Resolution

A sanitized line is not yet an ingredient. A label such as avocado toast add bacon or caesar salad no croutons carries a base item and additive or subtractive modifiers, and those modifiers move real ingredient quantities. Decomposition splits each line into a base item and a structured modifier list so that modifiers become discrete ingredient vectors rather than opaque text. The split runs vectorized against a keyword set; only the lightweight per-line tokenization is isolated in a helper, and it is applied without mutating any shared state.

from __future__ import annotations

from dataclasses import dataclass

import pandas as pd

_MODIFIER_KEYWORDS: frozenset[str] = frozenset(
    {"add", "extra", "no", "sub", "replace", "side"}
)


@dataclass(frozen=True, slots=True)
class DecomposedLine:
    base_item: str
    modifiers: tuple[str, ...]


def _parse_line(text: str) -> DecomposedLine:
    base: list[str] = []
    mods: list[str] = []
    carry = False  # a keyword flags the *following* token as a modifier target
    for token in text.split():
        if token in _MODIFIER_KEYWORDS:
            carry = True
            continue
        (mods if carry else base).append(token)
        carry = False
    return DecomposedLine(" ".join(base), tuple(mods))


def decompose_modifiers(normalized: pd.Series) -> pd.DataFrame:
    """Split base items from modifier tokens into a structured frame."""
    parsed = normalized.map(_parse_line)
    return pd.DataFrame(
        {
            "base_item": parsed.map(lambda d: d.base_item),
            "modifiers": parsed.map(lambda d: d.modifiers),
        },
        index=normalized.index,
    )

Modifiers resolve to positive or negative ingredient quantities during cost allocation: add bacon adds a rasher’s weight, no croutons subtracts one. Keeping the two apart at this stage is what prevents double-counting a promotional add-on and what lets dietary substitutions flow through to the ledger accurately instead of vanishing into an unparsed string.

Phase 3 Implementation — SKU Resolution and Error Routing

Each decomposed base item is matched against the master mapping table via a deterministic left join. The mapping frame is schema-validated first (a Pydantic v2 model gate), and anything the join cannot resolve is quarantined rather than defaulted — an unmapped category costed at zero is the single most dangerous silent corruption in this whole pipeline, because it looks like free food and inflates apparent margin.

from __future__ import annotations

from decimal import Decimal

import pandas as pd
from pydantic import BaseModel, ConfigDict, field_validator


class MappingRow(BaseModel):
    """One row of the version-controlled POS->SKU mapping table."""

    model_config = ConfigDict(frozen=True)

    pos_category: str
    ingredient_sku: str
    base_qty: Decimal          # canonical base units, never a float
    version: str

    @field_validator("base_qty")
    @classmethod
    def _non_negative(cls, v: Decimal) -> Decimal:
        if v < 0:
            raise ValueError("base_qty must be >= 0")
        return v


def validate_mapping(mapping_df: pd.DataFrame) -> pd.DataFrame:
    """Reject a malformed mapping table before it is trusted for a run."""
    rows = [MappingRow(**rec) for rec in mapping_df.to_dict("records")]
    dupes = mapping_df["pos_category"].duplicated()
    if dupes.any():
        raise ValueError(
            f"duplicate pos_category keys: {mapping_df.loc[dupes, 'pos_category'].tolist()}"
        )
    return pd.DataFrame([r.model_dump() for r in rows])


def resolve_skus(
    decomposed: pd.DataFrame, mapping_df: pd.DataFrame
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Deterministic left join to SKUs. Returns (resolved, quarantined)."""
    mapping = validate_mapping(mapping_df)

    merged = decomposed.merge(
        mapping[["pos_category", "ingredient_sku", "base_qty"]],
        left_on="base_item",
        right_on="pos_category",
        how="left",
    )

    unmapped = merged["ingredient_sku"].isna()
    quarantined = merged.loc[unmapped, ["base_item"]].drop_duplicates()
    resolved = merged.loc[~unmapped].drop(columns=["pos_category"])
    return resolved, quarantined

A single POS category frequently maps to multiple ingredient SKUs — a composite dish explodes into all of its raw materials — so this is a one-to-many join that fans one sales line across several ledger rows. That fan-out is why the mapping table must align directly with the recipe graph defined in the BOM schema: the same ingredient_sku keys, the same base units. Rather than failing the whole run on the first unmapped label, resolve_skus returns the quarantine set separately so operational continuity holds while a culinary manager fills the taxonomy gap.

Phase 4 Implementation — Volume Distribution and Ledger Handoff

With SKUs resolved, the pipeline distributes sales volume across ingredients by multiplying units_sold by base_qty, then applies yield corrections so theoretical usage matches real purchasing. Trim loss, moisture reduction, and portion drift all live in the yield factor; sourcing it correctly is the job of the yield factor calculation frameworks, and the same factors feed portion size standardization so a plate and a purchase order agree. All arithmetic stays vectorized, and monetary values stay in Decimal/NUMERIC end-to-end.

from __future__ import annotations

from decimal import Decimal

import pandas as pd


def generate_consumption_ledger(
    sales_df: pd.DataFrame,       # date, location_id, units_sold, unit_cost, base_item
    resolved_df: pd.DataFrame,    # base_item, ingredient_sku, base_qty
    yield_factors: pd.Series,     # index=ingredient_sku, value=Decimal in (0, 1]
) -> pd.DataFrame:
    """Distribute sales volume across SKUs and apply yield correction.

    Vectorized merge/assign — no DataFrame.apply over rows.
    """
    ledger = resolved_df.merge(sales_df, on="base_item", how="inner")

    # Theoretical quantity = units sold * per-unit recipe weight.
    ledger["theoretical_qty"] = ledger["units_sold"] * ledger["base_qty"]

    # Yield-adjust by SKU; a missing factor defaults to 1 (no correction), never 0.
    yf = ledger["ingredient_sku"].map(yield_factors).fillna(Decimal("1"))
    if (yf <= 0).any():
        bad = ledger.loc[yf <= 0, "ingredient_sku"].unique().tolist()
        raise ValueError(f"yield_factor must be in (0, 1]; offending SKUs: {bad}")
    ledger["adjusted_qty"] = ledger["theoretical_qty"] / yf

    return ledger.groupby(
        ["date", "location_id", "ingredient_sku"], as_index=False
    ).agg(adjusted_qty=("adjusted_qty", "sum"), unit_cost=("unit_cost", "first"))

The groupby collapse is what makes the ledger consumer-ready: one authoritative row per ingredient per location per day, with volume already summed across every dish that used it. That row is the actual-usage half of the theoretical-versus-actual equation, and it hands off cleanly to variance analysis and to waste tracking and routing systems that reconcile the residual gap.

Production Hardening

Moving from a working transform to a dependable nightly job across dozens of locations comes down to a handful of disciplines:

Idempotent execution. A re-run against the same input snapshot must produce a byte-identical ledger. Key the write on (date, location_id, ingredient_sku) with an upsert so a retried job overwrites exactly the rows it owns instead of appending duplicates. Deterministic sorting before write removes any ordering nondeterminism.
Schema enforcement at the boundary. Validate both the export and the mapping table with Pydantic v2 (or pandera) before the mapping layer sees them. Reject malformed exports up front rather than letting a stray column type corrupt a join silently.
Version control and audit trails. Store the mapping dictionary as a versioned artifact and stamp every ledger run with the version it consumed. That tag lets a reader detect a stale mapping and lets an auditor reproduce any historical number exactly.
Quarantine, don’t fail. Route unmapped categories and out-of-range yields to a quarantine table with a reason code; exclude only the affected rows from the ledger. The run completes, the gaps are visible, and no dish is ever costed at zero.
Unit normalization upstream. Canonicalize units at ingestion, not in Phase 4. By the time a base_qty reaches this pipeline it is already in grams or millilitres, so the distribution math never touches conversion — the same discipline enforced across the multi-location cost center architecture, where regional aliases must resolve to one base unit.
Performance and memory bounds. For exports over ~500k rows, partition by location and date, cast join keys to category dtype, and pre-index the mapping table. The mapping table is shared across locations; load it once and iterate locations, swapping only the sales frame, to keep the join sub-second and memory flat.
RBAC boundaries. Culinary managers hold write access to the mapping table and nothing else; the pipeline runs as a service account with read-only access to exports and the mapping artifact and write access to the ledger alone. Financial output stays isolated from manual taxonomy edits.

Failure Modes and Troubleshooting

Symptom	Likely cause	Detection / fix
A dish contributes zero consumption	Unmapped category defaulted to zero	Quarantine unmapped rows; never left-join-and-coerce. Alert on any nonzero quarantine count.
Consumption doubles for one ingredient	Composite item mapped twice, or a duplicate mapping key	Enforce unique `pos_category`; `validate_mapping` raises on duplicates before the run.
Theoretical usage understated everywhere	`yield_factor` defaulting to `1.0` where trim loss is real	Source yields from the yield frameworks; the `(0, 1]` guard rejects `0` and negatives.
Costs drift by fractions of a cent	Float arithmetic on money instead of `Decimal`/`NUMERIC`	Keep `base_qty`, `unit_cost`, and costs in `Decimal`; quantize once at the boundary.
Same label maps five different ways	Un-normalized strings (`Combo`, `combo`, `[LTO] Combo`)	Fix in Phase 1; assert the normalized key set matches the mapping key set.
A modifier silently vanishes	Modifier keyword not in the keyword set	Log unrecognized leading tokens; extend `_MODIFIER_KEYWORDS` and re-run.
Historical ledger changes retroactively	Mapping edited in place without a version bump	Version the mapping artifact; stamp and pin the `version` per run.

The through-line of every failure above is the same: a category must be explicitly resolved, validated, and versioned before it becomes a number — never guessed, never defaulted to zero, never computed in floating point.

FAQ

Should I use fuzzy matching or an LLM to map POS categories?

Not as the primary path. Approximate matching produces plausible-but-wrong attributions that never error, so they poison the ledger indefinitely. Resolve through explicit version-controlled rules and fail loud; if you use fuzzy matching at all, bound it by a confidence threshold and route its output to a human review queue rather than writing it straight to the ledger.

Model it as a one-to-many mapping: one pos_category row per constituent ingredient_sku, each with its own base_qty. The Phase 3 join fans the sold volume across every raw material, and the Phase 4 groupby re-aggregates by ingredient. Keep those SKUs identical to the leaves in your BOM graph so the two systems reconcile.

What happens to a category with no mapping entry?

It is quarantined — written to a separate table with a reason code — and excluded from the ledger, while every mapped row for that run still processes. The job completes, the gap is visible for a culinary manager to fill, and no dish is ever costed at zero.

Why keep the mapping table out of application code?

Because a recipe mapping change is a business edit, not a code change. Storing the table as a versioned data artifact makes each edit an auditable, revertible commit, lets you pin the exact version a historical run used, and keeps culinary managers out of the deploy pipeline.