Data Ingestion Recipe Parsing Workflows

POS API Polling Strategies

Every theoretical-versus-actual food cost number a multi-unit operator trusts is only as fresh as the sales feed underneath it. When point-of-sale (POS) transactions arrive once a night as a batch export — or worse, as a manual spreadsheet pull — variance compounds for hours before anyone can act on it, and a portioning drift that started at Friday lunch is invisible until Monday. This page sits inside the broader Data Ingestion & Recipe Parsing Workflows domain and scopes down to one specific problem: pulling itemized POS transactions into the costing engine continuously, deterministically, and without gaps or duplicates, so that actual consumption can be reconciled against theory in near real time.

The problem is not “call the sales endpoint.” It is maintaining a stateful, idempotent synchronization loop against a rate-limited vendor API — Toast, Square, Clover, Lightspeed — that returns pages of transactions keyed by an opaque cursor or an updated_after timestamp, while surviving 429s, partial pages, clock skew, and mid-batch crashes without ever losing a line item or counting one twice. Solve that boundary correctly and the sales feed becomes a trustworthy input; get it wrong and every downstream margin figure inherits a data-integrity bug that looks exactly like an operational one.

Concept and Data Contract

A polling engine is a long-running loop that repeatedly asks the POS vendor “what has changed since the last time I asked?” and advances a persisted marker so the next iteration resumes exactly where the previous one stopped. The two moving parts are the request contract (what you send) and the response contract (what you must be able to trust in what comes back).

The input to each cycle is a cursor position — either an opaque next_cursor string the vendor issued on the previous page, or a monotonic updated_after epoch timestamp — plus a bounded limit. The output contract, the load-bearing part, is that every transaction line that leaves this subsystem is a fully validated record carrying a stable transaction_id, a location_id, a POS menu_item_id, a quantity, a Decimal gross_price, and an updated_at timestamp — or it is quarantined with a machine-readable reason code. Nothing is silently coerced, and no line item is emitted twice across restarts.

That guarantee is what lets the rest of the estate build on the feed. Once validated, each menu_item_id is resolved against the POS taxonomy mappings that translate a sold item into its ingredient components, then costed through the recipe BOM cost roll-up. The polling layer’s only job is to deliver a complete, de-duplicated, correctly typed stream; the moment it starts making costing decisions it becomes impossible to reason about.

Architecture Decision: Cursor-Based Incremental Polling over Full Snapshots

Two decisions define this engine, and both are deliberate choices against a more obvious alternative.

The first is incremental delta sync over full-snapshot polling. It is simpler to re-request the entire day’s transactions every cycle and diff locally, and on a single low-volume site it even works. But a flagship location doing 800+ covers pushes tens of thousands of line items a day; re-pulling all of them every minute burns the vendor’s rate budget, saturates your egress, and forces a full local diff on each pass. A cursor-based delta — “give me only what changed after this marker” — makes each cycle O(new rows) instead of O(all rows), which is the only shape that scales across a fleet. The trade is that you must persist and correctly advance the cursor; a full snapshot is stateless but unaffordable, a delta is cheap but stateful, and for a continuous feed the state is worth managing.

The second is cursor pagination over offset pagination. Offset pagination (?page=3) silently drops or double-counts rows whenever the underlying set changes mid-scan — and a live POS set changes constantly. An opaque cursor encodes a stable position in the result stream, so a row inserted during pagination cannot shift the window under you. Where a vendor offers both, the cursor is the only correct choice for a feed that must be complete and duplicate-free.

A third decision is quieter but just as important: poll, do not depend on webhooks alone. Vendor webhooks are a fine low-latency nudge, but they are best-effort — they drop during outages and replay out of order. Polling with a persisted high-water mark is the source of truth that guarantees eventual completeness; webhooks, where available, only shorten the interval. For onboarding a new location or recovering after an extended outage, seed the state from a bulk pull through CSV bulk import automation before switching the cursor to the live stream, so the marker starts inside the vendor’s retention window rather than before it.

Phase 1 — Typed State and the Delta Request

The engine begins with a strictly typed state record and a validated request layer. State is deliberately minimal: the last cursor and the highest updated_at epoch successfully committed. Everything else — retry counts, timings — is transient. Monetary fields are Decimal from the moment they are parsed; a POS gross_price never touches a binary float.

from __future__ import annotations

import json
from decimal import Decimal
from pathlib import Path
from typing import Any

from pydantic import BaseModel, Field, field_validator


class PollState(BaseModel):
    """Durable, minimal resume marker for one location's feed."""

    location_id: str
    last_cursor: str | None = None
    last_high_water: int = 0  # epoch seconds of newest committed row

    @classmethod
    def load(cls, path: Path, location_id: str) -> "PollState":
        if path.exists():
            return cls.model_validate_json(path.read_text(encoding="utf-8"))
        return cls(location_id=location_id)

    def save(self, path: Path) -> None:
        tmp = path.with_suffix(".tmp")
        tmp.write_text(self.model_dump_json(), encoding="utf-8")
        tmp.replace(path)  # atomic swap: a crash never leaves a torn state file


class PosLineItem(BaseModel):
    """The response contract every emitted row must satisfy."""

    transaction_id: str = Field(min_length=1)
    location_id: str = Field(min_length=1)
    menu_item_id: str = Field(min_length=1)
    quantity: int = Field(gt=0)
    gross_price: Decimal = Field(ge=0)
    updated_at: int = Field(gt=0)  # epoch seconds

    @field_validator("gross_price", mode="before")
    @classmethod
    def _coerce_money(cls, v: Any) -> Decimal:
        # Parse from string to avoid binary-float rounding on cents.
        return Decimal(str(v))

The PollState.save method writes to a temp file and atomically renames it. That single detail is what makes the marker crash-safe: a process killed mid-write leaves the previous valid state intact rather than a half-written JSON file that would corrupt the next resume. The request layer builds parameters from that state — cursor when present, high-water timestamp on a cold start.

import httpx


class DeltaClient:
    def __init__(self, base_url: str, api_key: str, *, page_limit: int = 500) -> None:
        self._base_url = base_url.rstrip("/")
        self._page_limit = page_limit
        self._client = httpx.Client(
            headers={"Authorization": f"Bearer {api_key}", "Accept": "application/json"},
            timeout=httpx.Timeout(15.0),
        )

    def fetch_page(self, state: PollState) -> dict[str, Any]:
        params: dict[str, Any] = {"limit": self._page_limit}
        if state.last_cursor:
            params["cursor"] = state.last_cursor
        else:
            params["updated_after"] = state.last_high_water
        resp = self._client.get(f"{self._base_url}/v1/transactions/delta", params=params)
        resp.raise_for_status()
        return resp.json()

Phase 2 — Validation and Error Routing

A raw page from the vendor is untrusted input. Some rows will be missing a menu_item_id, carry a null price on a comped item, or arrive with an updated_at of zero. Validating each row against PosLineItem turns those into routed, reviewable quarantine records instead of silent poison in the cost feed. The parse is per-row so that one bad line never rejects an otherwise good page.

import logging
from collections.abc import Iterable

from pydantic import ValidationError

logger = logging.getLogger("pos.poller")


def validate_page(
    rows: Iterable[dict[str, Any]],
    *,
    batch_id: str,
) -> tuple[list[PosLineItem], list[dict[str, Any]]]:
    valid: list[PosLineItem] = []
    quarantined: list[dict[str, Any]] = []
    for raw in rows:
        try:
            valid.append(PosLineItem.model_validate(raw))
        except ValidationError as exc:
            quarantined.append(
                {
                    "batch_id": batch_id,
                    "raw": raw,
                    "reason": exc.errors(include_url=False),
                }
            )
    if quarantined:
        logger.warning(
            "quarantined_rows",
            extra={"batch_id": batch_id, "count": len(quarantined)},
        )
    return valid, quarantined

Quarantined rows are written to a durable side table — never dropped — so a culinary manager or data owner can review why a line failed and, if it was a genuine transaction, replay it after the mapping is fixed. The batch_id threads through every log line so a single quarantined row can be traced from the exact poll cycle that produced it without grepping free text. Structured JSON logging with that correlation id is what makes an intermittent feed debuggable at estate scale.

The most common validation failure is not malformed data — it is an unmapped menu_item_id, a new menu item the POS started selling before anyone added it to the recipe database. That is a mapping gap, not a parse error, so it is surfaced against the POS taxonomy mappings registry as a distinct reason code rather than being silently costed at zero, which mirrors the vendor-specific work in mapping Toast POS categories to ingredient SKUs.

Phase 3 — Serialization and Downstream Handoff

Once a page is validated, the engine performs three things in a strict order that guarantees idempotency: it hands the valid rows to the downstream consumer, then it computes the new high-water mark, then — and only then — it commits state. Committing state before the handoff succeeds would let a crash skip rows forever; committing after guarantees at-least-once delivery, which the downstream de-duplicates on transaction_id.

import time
from collections.abc import Callable

Sink = Callable[[list[PosLineItem]], None]


def run_cycle(
    client: DeltaClient,
    state: PollState,
    state_path: Path,
    sink: Sink,
    *,
    batch_id: str,
) -> bool:
    """Process one page. Returns True if rows were committed, False if idle."""
    payload = client.fetch_page(state)
    rows: list[dict[str, Any]] = payload.get("data", [])
    if not rows:
        return False

    valid, quarantined = validate_page(rows, batch_id=batch_id)
    persist_quarantine(quarantined)  # durable side table, never dropped

    # 1. Hand off BEFORE advancing state — at-least-once, deduped downstream.
    sink(valid)

    # 2. Advance the marker from the newest row we actually saw.
    high_water = max((r.updated_at for r in valid), default=state.last_high_water)
    next_cursor = payload.get("next_cursor")

    # 3. Commit state last, atomically.
    state.last_cursor = next_cursor
    state.last_high_water = max(state.last_high_water, high_water)
    state.save(state_path)
    return True

The sink is intentionally an injected callable rather than hard-wired reconciliation logic. In production it publishes to a durable queue that the costing workers drain — the same async handoff pattern documented in async batch processing workflows and its Celery menu-sync implementation. Decoupling the feed from the compute means a slow reconciliation cannot stall the poll loop, and a poll outage cannot lose already-published work. Each published line carries its transaction_id as the idempotency key so a replayed page is absorbed, not double-counted.

The scheduler that drives run_cycle is volume-aware. Rather than a fixed interval, it compresses during known peaks and extends during off-peak windows, trading vendor quota against freshness per location.

def next_interval(covers_last_hour: int, *, floor_s: int = 30, ceil_s: int = 600) -> int:
    """Faster polling when sales velocity is high, slower when it is quiet."""
    if covers_last_hour >= 120:
        return floor_s
    if covers_last_hour <= 10:
        return ceil_s
    # Linear interpolation between the two bounds.
    span = ceil_s - floor_s
    scaled = span * (120 - covers_last_hour) / (120 - 10)
    return int(floor_s + scaled)

Production Hardening

A loop that is correct on the happy path but brittle under real vendor conditions will still erode trust in the numbers. The following controls keep the feed dependable across a fleet.

Idempotency keys everywhere. The transaction_id is the natural key downstream; the atomic state file is the resume key upstream. Together they mean a retried page, a double-fired schedule, or a crash-and-restart converges to the same committed set instead of inflating actual consumption.
Backoff with jitter on 429/503. Aggressive intervals during peak service trigger rate limits, and a synchronized fleet all retrying at once creates a thundering herd. Exponential backoff plus randomized jitter spreads retries out; the throttling contract, quota accounting, and circuit-breaker specifics live in the dedicated rate limiting strategies for POS APIs page.
Clock-skew guard on the high-water mark. Advance the marker only from the updated_at of rows you actually validated, never from your own wall clock. If the vendor’s clock runs behind yours, a timestamp-based updated_after can skip the last few seconds of a window; overlap the query by a small epsilon and lean on transaction_id de-duplication to absorb the re-reads.
Bounded retries, then quarantine. Retry only idempotent-safe transient failures (429, 503, connection resets). Permanent failures — a 400 on a malformed cursor, a 401 on an expired key — must break the loop loudly and alert, not retry blindly against a wall.
Memory discipline. Hold one page in memory at a time and let the queue absorb backlog; never accumulate a day of transactions in a list. A single location’s cycle should have a flat footprint regardless of how far behind it fell.
Decimal to the boundary. Keep gross_price as Decimal through the entire handoff and persist to a PostgreSQL NUMERIC column. Rounding happens only at the reporting edge, never mid-pipeline, so the actual-cost figure the variance engine reads is exact base-10.

Failure Modes and Troubleshooting

Most polling failures are silent — the job reports success while quietly losing or duplicating sales. These are the patterns to detect deliberately.

State committed before handoff. If you advance the cursor before the sink confirms, a crash in that window skips those rows permanently and the actual-cost feed shows a phantom drop in consumption. The fix is ordering: publish, then compute the mark, then commit state — never the reverse.
Offset pagination drift. A vendor endpoint paged by offset/page will double-count or skip rows whenever the live set changes mid-scan. The symptom is a variance that jitters run to run on identical data. Switch to the cursor endpoint; if only offset exists, snapshot a stable window with a frozen updated_before bound.
Cursor expiry after an outage. Opaque cursors expire outside the vendor’s retention window. Resuming a stale cursor after a long outage silently returns an empty page or a 400. Detect the gap between last_high_water and now; if it exceeds retention, fall back to a bounded timestamp backfill rather than trusting the dead cursor.
Zero-priced comps swallowing signal. Comped and voided line items arrive with gross_price of zero or null. Costed naively they read as free food and understate actual consumption. Validate them through the same contract but tag them so the variance layer treats a comp as a known cost, not a missing one.
Timezone-naive high-water marks. Mixing a local timestamp into an epoch marker skips or replays an hour twice a year at DST boundaries. Keep every timestamp in UTC epoch seconds end to end; convert to local only for human-facing reports.
Unmapped items costed at zero. A new menu item the POS sells before it exists in the recipe database resolves to no BOM and silently contributes zero to theoretical cost, inflating apparent variance. Surface unmapped menu_item_ids against the mapping registry as an explicit reason code before they reach the reconciliation, not after.

Once the feed is complete and typed, actual consumption is reconciled against theory through the variance mapping methodologies that separate genuine operational drift from ingest artifacts — which is only meaningful when the polling layer guarantees that a variance is real and not a lost or duplicated transaction.

Frequently Asked Questions

Why poll incrementally instead of re-pulling the full day of transactions each cycle?

Incremental delta sync makes each cycle O(new rows) instead of O(all rows). A flagship location pushes tens of thousands of line items a day; re-pulling all of them every minute exhausts the vendor’s rate budget, saturates egress, and forces a full local diff each pass. A cursor-based delta fetches only what changed, which is the only shape that scales across a multi-location fleet. The cost is that you must persist and correctly advance the cursor.

Cursor pagination or offset pagination for a live POS feed?

Cursor, always, when the vendor offers it. Offset pagination silently drops or double-counts rows whenever the underlying result set changes mid-scan, and a live POS set changes constantly. An opaque cursor encodes a stable position in the stream, so a row inserted during pagination cannot shift the window under you. If only offset exists, freeze a stable window with an updated_before bound before paging.

How do I make the loop safe to restart after a crash?

Order the writes and make the state file atomic. Publish validated rows to the queue first, then compute the high-water mark, then commit state last — so a crash before commit only replays already-published rows, which the downstream de-duplicates on transaction_id. Writing the state file to a temp path and renaming it means a process killed mid-write leaves the previous valid marker intact instead of a torn JSON file.

Why keep prices as Decimal instead of float?

Binary floats cannot represent decimal cents exactly, and the error compounds across thousands of line items until the actual-cost total drifts from what the POS actually charged. Parsing gross_price from a string into Decimal and persisting to a PostgreSQL NUMERIC column keeps every intermediate value exact; rounding happens only at the reporting boundary, never mid-pipeline.

What happens to a menu item the POS sells before it is in the recipe database?

It resolves to no BOM and, if handled naively, contributes zero to theoretical cost while still appearing in actual sales — which inflates apparent variance and looks like an operational problem. Surface the unmapped menu_item_id against the POS taxonomy registry as an explicit reason code so it is flagged for mapping rather than silently costed at zero.

Do I still need polling if the vendor supports webhooks?

Yes. Webhooks are a low-latency nudge but they are best-effort — they drop during outages and can replay out of order. Polling with a persisted high-water mark is the source of truth that guarantees eventual completeness; webhooks only shorten the interval. For onboarding or outage recovery, seed state from a bulk import before switching to the live cursor.

Rate Limiting Strategies for POS APIs — backoff, jitter, quota accounting, and circuit breakers for the same loop.
CSV Bulk Import Automation — the bulk vector used to seed state before switching to live polling.
Async Batch Processing Workflows — the durable queue and worker layer the poll loop hands off to.
Mapping POS Taxonomies to Ingredients — resolving each sold menu item into its recipe components.
Variance Mapping Methodologies — the downstream analytics this feed makes trustworthy.

Up one level: Data Ingestion & Recipe Parsing Workflows.

For deeper implementation reference, consult the official Pydantic documentation on validators, strict types, and model_validate_json.