Data Ingestion Recipe Parsing Workflows

Implementing Celery for Async Menu Syncs

This page walks a food-tech developer through wiring up a Celery worker fleet that pushes a central menu change — a new vendor price, an edited recipe BOM, a remapped POS item — out to every location without a single blocking HTTP call. It is the concrete implementation companion to the broader async batch processing workflows; read that first for the decoupling rationale, then follow the numbered steps here to stand up a queue you can point at this week’s menu drop.

The task this solves is narrow and operationally sharp: when a culinary director updates ingredient pricing centrally and fifty locations must reconcile their theoretical food cost before the dinner service, a synchronous fan-out stalls on the slowest store and cascades timeouts back through the whole request. Moving each location’s reconciliation onto a queue turns a fragile all-or-nothing broadcast into fifty independent, retryable jobs — and an idempotency key keeps a retried job from double-counting an ingredient.

Prerequisites and Data Contract

Every step below assumes the following environment. If a version or shape drifts, the transforms will still run but can silently emit a plausible-but-wrong cost.

Python 3.11+, celery 5.3+, redis-py 5.x, pandas 2.x, and numpy 1.26+.
Redis 7+ as both broker and result backend (separate logical DBs), and PostgreSQL 13+ as the cost-analytics sink, where every monetary column is NUMERIC, never a binary float.
Ingredient identifiers already reconciled against the canonical POS taxonomy mappings upstream — this pipeline joins on stable IDs, it does not resolve aliases.
A menu_version string (a content hash of the published menu) generated once at push time, so every location reconciles against the same immutable snapshot.

The payload each task receives is a single JSON document with three parallel record lists: the menu items, the BOM mappings that tie each POS item to its ingredients, and the ingredient pricing. Quantities and costs travel as strings, not floats, so no precision is lost in transit before they reach NUMERIC.

from typing import TypedDict

class MenuItem(TypedDict):
    pos_item_id: str

class BOMMapping(TypedDict):
    source_pos_id: str
    ingredient_sku: str
    qty_used: str        # decimal string, base recipe unit

class IngredientPrice(TypedDict):
    sku: str
    unit_cost: str       # decimal string, per base unit
    yield_factor: str    # decimal string in (0, 1]

class MenuPayload(TypedDict):
    items: list[MenuItem]
    bom_mappings: list[BOMMapping]
    ingredient_pricing: list[IngredientPrice]

Step-by-Step Implementation

Each block is self-contained. Build the pipeline one gate at a time; do not collapse them into a single mega-task, because the retry and locking boundaries are exactly what make the fan-out safe.

Step 1 — Configure the app, broker, and queue routing

Late acknowledgement is the single most important setting for a financial pipeline: a worker that dies mid-task must return its message to the broker, not silently drop a location’s update. worker_prefetch_multiplier=1 stops a fast worker from hoarding jobs while heterogeneous location payloads sit idle behind it.

from celery import Celery

app = Celery("menu_sync")
app.conf.update(
    broker_url="redis://redis-broker:6379/0",
    result_backend="redis://redis-backend:6379/1",
    task_acks_late=True,                 # requeue on worker loss, never drop
    task_reject_on_worker_lost=True,
    task_serializer="json",
    accept_content=["json"],
    result_serializer="json",
    worker_prefetch_multiplier=1,        # fair dispatch across uneven payloads
    task_default_queue="menu_sync",
    task_routes={
        "menu_sync.tasks.sync_location_menu": {"queue": "menu_sync"},
    },
)

Step 2 — Derive a deterministic idempotency key and lock

The key is a hash of location, menu version, and a five-minute time window. Two rapid pushes of the same version collapse to one job; a genuine new version gets its own key. A Redis SET ... NX EX is an atomic check-and-set: the first task to claim the key wins, every duplicate short-circuits.

import hashlib
import time
import redis

redis_client = redis.Redis(
    host="redis-broker", port=6379, db=0, decode_responses=True
)

def generate_idempotency_key(location_id: str, menu_version: str) -> str:
    window = int(time.time() // 300)  # 5-minute collision window
    payload = f"{location_id}:{menu_version}:{window}"
    return hashlib.sha256(payload.encode()).hexdigest()[:16]

def acquire_lock(idem_key: str, ttl_s: int = 300) -> bool:
    return bool(redis_client.set(f"menu_sync:lock:{idem_key}", "1", nx=True, ex=ttl_s))

Step 3 — Define the sync task with bounded retries

The task acquires the lock, delegates the heavy work, and lets Celery’s retry machinery handle transient failures with exponential backoff. If the lock is already held, it returns skipped — that is the idempotency guarantee doing its job, not an error.

from celery.utils.log import get_task_logger

logger = get_task_logger(__name__)

@app.task(
    bind=True, max_retries=3, default_retry_delay=30,
    acks_late=True, name="menu_sync.tasks.sync_location_menu",
)
def sync_location_menu(self, location_id: str, menu_payload: dict, menu_version: str) -> dict:
    idem_key = generate_idempotency_key(location_id, menu_version)
    if not acquire_lock(idem_key):
        logger.info("Duplicate sync for %s, skipping", location_id)
        return {"status": "skipped", "location": location_id, "key": idem_key}
    try:
        reconciled = reconcile_menu_chunks(menu_payload)  # Step 4
        write_theoretical_cost(location_id, menu_version, reconciled)  # Step 5
        return {"status": "success", "location": location_id, "key": idem_key}
    except Exception as exc:
        countdown = 30 * (2 ** self.request.retries)  # 30s, 60s, 120s
        logger.warning("Sync failed for %s, retry in %ss: %s", location_id, countdown, exc)
        raise self.retry(exc=exc, countdown=countdown)

Step 4 — Reconcile BOM against POS in vectorized chunks

Large payloads (thousands of SKUs with nested sub-recipes) blow past worker memory if loaded whole. Slice the frame into deterministic chunks and use vectorized merge — never a row-wise apply — to join each menu item to its BOM lines and pricing. This step does structural reconciliation only; it moves the decimal-string cost columns through untouched so no float ever touches the money path.

import pandas as pd

CHUNK_SIZE = 500

def reconcile_menu_chunks(payload: dict) -> pd.DataFrame:
    df_menu = pd.json_normalize(payload["items"])
    df_bom = pd.json_normalize(payload["bom_mappings"])
    df_price = pd.json_normalize(payload["ingredient_pricing"])

    out: list[pd.DataFrame] = []
    for start in range(0, len(df_menu), CHUNK_SIZE):
        chunk = df_menu.iloc[start:start + CHUNK_SIZE]
        joined = (
            chunk.merge(df_bom, left_on="pos_item_id",
                        right_on="source_pos_id", how="left")
                 .merge(df_price, left_on="ingredient_sku",
                        right_on="sku", how="left")
        )
        # Unmatched joins surface as nulls — quarantine, do not silently cost as zero.
        joined["is_unmapped"] = joined["ingredient_sku"].isna() | joined["unit_cost"].isna()
        out.append(joined[["pos_item_id", "ingredient_sku",
                            "qty_used", "unit_cost", "yield_factor", "is_unmapped"]])
    return pd.concat(out, ignore_index=True)

Step 5 — Compute theoretical cost in NUMERIC at the sink

Cost math lives in PostgreSQL, where NUMERIC gives exact decimal arithmetic. Stage the reconciled rows, then let the database evaluate (qty_used / yield_factor) * unit_cost. Rows flagged is_unmapped are excluded from the roll-up and logged, so an unmapped ingredient can never masquerade as free. This mirrors the exact-arithmetic contract the recipe BOM cost roll-up depends on.

from sqlalchemy import create_engine, text

engine = create_engine("postgresql+psycopg://analytics@warehouse/foodcost")

def write_theoretical_cost(location_id: str, version: str, df) -> None:
    priced = df.loc[~df["is_unmapped"]].copy()
    priced["location_id"] = location_id
    priced["menu_version"] = version
    priced.to_sql("stg_menu_reconcile", engine, if_exists="append", index=False)
    with engine.begin() as conn:
        conn.execute(text("""
            INSERT INTO theoretical_cost (location_id, menu_version, pos_item_id, cost)
            SELECT location_id, menu_version, pos_item_id,
                   SUM((qty_used::NUMERIC / NULLIF(yield_factor::NUMERIC, 0))
                       * unit_cost::NUMERIC)
            FROM stg_menu_reconcile
            WHERE menu_version = :v AND location_id = :loc
            GROUP BY location_id, menu_version, pos_item_id
            ON CONFLICT (location_id, menu_version, pos_item_id)
            DO UPDATE SET cost = EXCLUDED.cost
        """), {"v": version, "loc": location_id})

Step 6 — Classify errors and route permanent failures to a DLQ

Not every failure deserves a retry. A DNS blip or a 503 from the POS API polling layer is transient and should back off; a schema mismatch or an invalid POS mapping is permanent and must land in a dead-letter queue for a human, not spin through three doomed retries.

def classify_error(exc: Exception) -> str:
    if isinstance(exc, (TimeoutError, ConnectionError)):
        return "transient"
    if "invalid_schema" in str(exc).lower():
        return "permanent"
    return "unknown"

@app.task(bind=True, max_retries=3, name="menu_sync.tasks.resilient_chunk")
def resilient_chunk(self, chunk_data: dict) -> dict:
    try:
        execute_chunk_sync(chunk_data)
        return {"status": "ok", "chunk_id": chunk_data["chunk_id"]}
    except Exception as exc:
        kind = classify_error(exc)
        if kind == "permanent":
            route_to_dlq(chunk_data, reason=str(exc))
            logger.error("Permanent failure, chunk %s -> DLQ", chunk_data["chunk_id"])
            return {"status": "failed_permanent", "chunk_id": chunk_data["chunk_id"]}
        raise self.retry(exc=exc, countdown=15 * (2 ** self.request.retries))

Verification and Validation

Confirm each guarantee before trusting the fan-out in production.

Duplicates actually skip. Enqueue the same (location_id, menu_version) twice within the window and assert exactly one returns success and the other skipped. The Redis key menu_sync:lock:<key> should exist with a TTL between 0 and 300.
Late acks requeue on crash. Kill a worker mid-task (kill -9) and watch the message reappear on menu_sync in the broker. With task_reject_on_worker_lost=True the job must be redelivered, not lost.
No float on the money path. Assert the staged frame’s cost columns are still object/string dtype: assert df["unit_cost"].map(type).eq(str).all(). The first NUMERIC cast happens in SQL, nowhere earlier.
Unmapped rows are excluded, not zeroed. After a run, SELECT count(*) FROM stg_menu_reconcile WHERE ... should reconcile against the is_unmapped count logged by the worker. A rising unmapped count for one location points at drift in its POS taxonomy, not at the queue.
Costs match a baseline. Diff the new theoretical_cost against the prior version before promoting it, exactly as a variance mapping job would — a swing beyond your tolerance band means a bad price row, not a real menu change.

Gotchas and Edge Cases

yield_factor = 0 divide-by-zero. The cost formula divides by yield. A zero or null factor from a bad pricing row will error the whole INSERT. The NULLIF(yield_factor::NUMERIC, 0) guard turns it into a null cost you can detect, but the real fix is validating factors upstream in the yield factor calculation frameworks so a zero never reaches the queue.
Lock TTL shorter than the job. If a chunked reconciliation runs longer than the 300-second lock, the key expires mid-flight and a concurrent retry can start a second copy. Size the TTL to comfortably exceed your p99 task duration, or renew the lock with a heartbeat for very large payloads.
IEEE-754 drift from a stray float. The moment any cost is read into a pandas float column — even for a quick “sanity check” sum — sub-cent error starts accumulating across thousands of covers. Keep costs as strings until the NUMERIC cast; use decimal.Decimal if you must compute in Python.
Window-boundary duplicate. Two pushes straddling the five-minute window boundary hash to different keys and both run. This is intentional (a real new version must sync) but means the idempotency window is a coalescing convenience, not a global uniqueness guarantee — the sink’s ON CONFLICT upsert is the true last line of defence.
Cross-location bleed. Never route two locations’ payloads through one shared task or lock. A slow or failing store must fail in isolation; folding them together reintroduces exactly the cascade this design removes and undermines the multi-location cost center architecture that keeps each store’s P&L independent.

Frequently Asked Questions

Why Redis for both the broker and the distributed lock instead of a database queue?

Redis gives you an atomic SET NX EX for the idempotency lock and low-latency message delivery from the same infrastructure, which keeps the moving parts to a minimum for a menu-sync workload measured in thousands — not billions — of tasks. If you already run RabbitMQ you can use it as the broker and keep Redis solely for the lock; the task logic is unchanged. Reach for a database-backed queue only when you need transactional enqueue-with-write semantics that Redis cannot give you.

Should the theoretical cost be computed in pandas or in PostgreSQL?

Do the structural reconciliation — joining BOM lines to POS items and pricing — in vectorized pandas, because that is where merge and chunking shine. Do the monetary arithmetic in PostgreSQL NUMERIC, because exact decimal math on money is precisely what a binary float cannot promise. Splitting the two keeps the fast path fast and the money path exact, and it means the same cost expression is used everywhere the BOM cost roll-up reads it.

How do I scale this across regions without cross-region latency?

Deploy a queue per geographic cluster (menu_sync_us, menu_sync_eu) and route each task with task_routes on location metadata, so a push in one region never waits behind another. Run a dedicated worker pool per queue sized to that region’s location count. This keeps a synchronized menu drop from serializing across the Atlantic and lets you tune concurrency and retry budgets independently per region.

What belongs in the dead-letter queue versus a retry?

Retry anything transient and self-healing — timeouts, connection resets, rate-limit 429s, transient 5xx from the POS. Dead-letter anything a retry cannot fix: schema mismatches, invalid POS mappings, or a payload that fails validation. The rule of thumb is whether waiting and trying again could plausibly succeed; if not, a human needs to see it, and burning three retries first just delays the alert and wastes worker capacity.

Async Batch Processing Workflows — the parent workflow and decoupling rationale this Celery pipeline implements.
POS API Polling Strategies — the upstream source whose transient failures feed this task’s retry classifier.
Mapping POS Taxonomies to Ingredients — the canonical ID resolution this sync joins against.
Designing Recipe BOM Databases — the bill-of-materials schema whose cost roll-up this pipeline writes.
Variance Mapping Methodologies — the theoretical-vs-actual layer that diffs each synced cost against a baseline.
Threshold Tuning for Alerts — where a cost swing surfaced by this sync becomes an actionable alert.

Up one level: Async Batch Processing Workflows.

For deeper implementation reference, consult the official Celery task execution documentation for retry, acknowledgement, and routing semantics.