Data Ingestion Recipe Parsing Workflows

PDF Recipe Extraction Pipelines

Multi-unit operators and culinary managers routinely receive standardized recipe cards, vendor specification sheets, and corporate menu guidelines as unstructured PDFs. Translating these static documents into actionable food cost metrics requires a deterministic extraction pipeline that never silently mutates a quantity or drops a line. This guide sits inside the broader data ingestion and recipe parsing workflows architecture, where the PDF extraction layer is the foundational bridge between culinary documentation and the cost model that consumes it. The scope here is narrow and deliberate: the sequential extraction, schema validation, and structured serialization of PDF recipe data before it reaches downstream unit normalization and cost calculation. Everything downstream trusts this layer to be correct, so this layer must fail loudly rather than approximately.

The reader of this page is usually the food tech developer who owns the intake code, but the failure it prevents is felt by the operator: a single misparsed 1/2 read as 12, or an ingredient line swallowed by a page break, distorts theoretical food cost for every location that recipe touches until a monthly reconciliation exposes the gap. The pipeline below is engineered so those defects surface at ingestion, attached to a page and line number, instead of propagating into margin reports.

Data Contract: Inputs, Constraints, and Output

Before any code runs, the layer needs an explicit contract so that producers (the culinary teams exporting cards) and consumers (the cost engine) agree on what crosses the boundary.

Inputs. A born-digital PDF containing selectable text — a recipe card, a distributor specification sheet, or a corporate menu standard. Each ingredient line is expected to follow the culinary convention <quantity> <unit> <ingredient name> [– note], where quantity may be a whole number, a decimal, or a mixed fraction (1 1/2), and unit is drawn from a bounded vocabulary of measurement abbreviations. Scanned image-only PDFs are explicitly out of contract for the deterministic path and are diverted to the OCR fallback discussed in the PyPDF2 and regex parsing walkthrough.

Schema constraints. Every emitted record must carry a numeric quantity (fractions resolved to an exact rational, then to float), a unit present in the canonical vocabulary, a non-empty ingredient name, and source lineage (source_page, source_line). Records that violate any of these are not coerced — they are quarantined.

Output contract. A stream of validated, JSON-serializable records with stable field names, each tagged with a recipe_id and its source coordinates. Monetary fields use Python’s Decimal, never binary float, so that downstream BOM cost roll-up arithmetic stays exact. The units on the record are raw at this stage; canonicalization to base weight/volume happens downstream via the yield factor calculation frameworks, and this layer’s only obligation is to preserve the operator’s intent faithfully.

Architecture Decision: Direct Text Extraction over OCR

The central design choice is to parse the PDF’s embedded text stream directly rather than rasterizing pages and running optical character recognition. This is deliberate. OCR is a probabilistic process: it assigns confidence scores to character candidates and, under pressure from decorative recipe-card fonts or low-contrast scans, substitutes glyphs — a 5 becomes an 8, an l becomes a 1. Those substitutions are silent, and in a cost pipeline a silent numeric error is the worst possible outcome because it produces a plausible-but-wrong margin rather than a visible crash.

Direct text-stream extraction, by contrast, reads the exact bytes the PDF producer embedded. It is byte-level reproducible: the same file yields the same characters on every run, which makes regex anchoring deterministic and lets a nightly batch be verified bit-for-bit. It is also an order of magnitude faster and cheaper than rasterization plus inference, which matters when a corporate deck spans hundreds of recipes across a fleet.

OCR is not abandoned — it is demoted to a fallback that only engages when a page yields no extractable text (a scanned image). That branch is quarantined for heightened review rather than trusted inline, so the confidence penalty of OCR never contaminates the deterministic majority path. This mirrors the same async-versus-inline reasoning used across the ingestion domain’s async batch processing workflows: the expensive, less certain path is isolated so it cannot block or corrupt the fast, certain one.

Phase 1: Deterministic Text Stream Extraction

Extraction relies on anchored regular expressions that match culinary measurement conventions: fractional quantities (1/2, 3/4), decimal weights, and standardized unit abbreviations (oz, lb, kg, ea, gal, qt). Capture groups are strictly partitioned to isolate the quantity, unit, name, and any trailing note, preventing procedural preparation text from contaminating ingredient-level data. The baseline methodology is worked end to end in parsing PDF menus with PyPDF2 and regex; the block below preserves layout coordinates so downstream errors can be mapped back to an exact page and line.

import re
from typing import List, Optional
from dataclasses import dataclass

# Deterministic regex for culinary ingredient lines.
INGREDIENT_PATTERN = re.compile(
    r"^(?P<qty>(?:\d+\s*\d*\/\d+|\d+\.?\d*))\s*"
    r"(?P<unit>(?:oz|lb|g|kg|ea|gal|qt|cup|tbsp|tsp|pinch|dash|to\s*taste))\s*"
    r"(?P<name>[A-Za-z0-9\s\-\(\),]+?)(?:\s*[-–]\s*(?P<note>.+))?$",
    re.IGNORECASE | re.MULTILINE,
)

@dataclass
class RawIngredient:
    quantity: str
    unit: str
    name: str
    note: Optional[str] = None
    page: int = 0
    line_index: int = 0

def extract_text_stream(pdf_path: str) -> List[RawIngredient]:
    """Extract raw text from a PDF and parse ingredient lines deterministically.

    Uses pdfplumber's explicit text extraction so layout coordinates are
    preserved for downstream error mapping. Parses page-by-page to keep a
    constant memory footprint regardless of deck size.
    """
    import pdfplumber

    ingredients: List[RawIngredient] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            text = page.extract_text()
            if not text:
                # No extractable text: scanned image page -> OCR fallback branch.
                continue
            for line_idx, line in enumerate(text.splitlines()):
                match = INGREDIENT_PATTERN.match(line.strip())
                if match:
                    ingredients.append(
                        RawIngredient(
                            quantity=match.group("qty"),
                            unit=match.group("unit"),
                            name=match.group("name").strip(),
                            note=match.group("note"),
                            page=page_num,
                            line_index=line_idx,
                        )
                    )
    return ingredients

The RawIngredient dataclass is intentionally permissive — quantity and unit are still strings here. Nothing is trusted yet; the record is a faithful transcription of what the page said, coordinates included, ready to be judged by the validation layer.

Phase 2: Schema Enforcement and Quarantine Routing

Raw extracted strings must immediately pass through a strict validation layer. Culinary managers require that an ambiguous measurement or a missing unit triggers a quarantine queue rather than silent data corruption. The validation routine resolves fractional quantities to exact rationals, enforces the bounded field contract, and routes any record that fails a constraint to a structured error log carrying line-level context and PDF page coordinates. That lineage is what lets a reviewer open the source card, find the offending line in seconds, and fix it without halting the batch.

Monetary fields on the validated record use Decimal rather than float, because binary floating point cannot represent common currency values exactly and the drift compounds once these records feed cost aggregation. This is the same discipline enforced throughout the BOM cost roll-up layer downstream.

from decimal import Decimal
from fractions import Fraction
from typing import Dict, List, Union
from dataclasses import dataclass
import logging

from pydantic import BaseModel, ValidationError, field_validator

logger = logging.getLogger(__name__)

# Re-declared here so the block is self-contained; in production this dataclass
# is imported from the shared module alongside the extractor above.
@dataclass
class RawIngredient:
    quantity: str
    unit: str
    name: str
    note: "str | None" = None
    page: int = 0
    line_index: int = 0

CANONICAL_UNITS = {"oz", "lb", "g", "kg", "ea", "gal", "qt", "cup", "tbsp", "tsp"}

class ValidatedIngredient(BaseModel):
    recipe_id: str
    ingredient_name: str
    quantity: float
    unit: str
    yield_count: int = 1
    theoretical_cost: Decimal = Decimal("0.00")  # exact money, never binary float
    source_page: int
    source_line: int

    @field_validator("quantity", mode="before")
    @classmethod
    def normalize_quantity(cls, v: Union[str, float]) -> float:
        """Resolve fractional strings to a deterministic float via exact rationals."""
        if isinstance(v, float):
            return v
        try:
            return float(Fraction(str(v).replace(" ", "")))
        except (ValueError, ZeroDivisionError):
            raise ValueError(f"Non-numeric quantity detected: {v!r}")

    @field_validator("unit")
    @classmethod
    def unit_in_vocabulary(cls, v: str) -> str:
        u = v.strip().lower()
        if u not in CANONICAL_UNITS:
            raise ValueError(f"Unit outside canonical vocabulary: {v!r}")
        return u

def validate_and_quarantine(
    raw_ingredients: List[RawIngredient], recipe_id: str
) -> List[ValidatedIngredient]:
    validated: List[ValidatedIngredient] = []
    quarantine: List[Dict] = []

    for raw in raw_ingredients:
        try:
            validated.append(
                ValidatedIngredient(
                    recipe_id=recipe_id,
                    ingredient_name=raw.name,
                    quantity=raw.quantity,
                    unit=raw.unit,
                    source_page=raw.page,
                    source_line=raw.line_index,
                )
            )
        except ValidationError as exc:
            quarantine.append(
                {
                    "recipe_id": recipe_id,
                    "error": exc.errors(include_url=False),
                    "page": raw.page,
                    "line": raw.line_index,
                    "raw_text": f"{raw.quantity} {raw.unit} {raw.name}",
                }
            )

    if quarantine:
        logger.warning(
            "quarantined_records",
            extra={"recipe_id": recipe_id, "count": len(quarantine)},
        )
        # Persist to the structured error log / quarantine bucket in production.
    return validated

The quarantine list is not an afterthought — it is a first-class output of the layer. A batch that produces zero validated records and a full quarantine queue is a signal (a template changed, a font broke text extraction) that must page a human, not a result to be silently discarded.

Phase 3: Async Serialization and Downstream Handoff

Once validated, records are serialized to JSON and published to an asynchronous message broker. This decouples extraction from the heavier transformations that follow, so a slow consumer never back-pressures the parser. Emitting extraction events asynchronously lets parallel consumers pick up the work: one may run unit canonicalization to convert raw units into base weight and volume, another may cross-reference extracted yields against actual sales velocity gathered through POS API polling strategies, and a third may reconcile the same ingredients arriving as spreadsheets through CSV bulk import automation. The async boundary is also what prevents memory bottlenecks during high-volume batch runs and enables horizontal scaling of the cost calculation workers.

import asyncio
import json
from contextlib import asynccontextmanager
from decimal import Decimal
from typing import Any, List
# Production: use aio_pika (RabbitMQ), aiokafka, or redis.asyncio for brokering.
# `ValidatedIngredient` is defined in the previous block.

def _json_default(value: Any) -> str:
    if isinstance(value, Decimal):
        return str(value)  # preserve exact money as a string, not a lossy float
    raise TypeError(f"Unserializable type: {type(value).__name__}")

@asynccontextmanager
async def async_queue_publisher(queue_url: str):
    """Mock async publisher for illustration. Replace with a real broker client."""
    print(f"Connecting to queue: {queue_url}")
    yield {"publish": lambda msg: print(f"Published: {msg}")}
    print("Queue connection closed.")

async def publish_validated_records(
    records: "List[ValidatedIngredient]", queue_url: str
) -> None:
    async with async_queue_publisher(queue_url) as queue:
        for record in records:
            payload = json.dumps(record.model_dump(), default=_json_default)
            await queue["publish"](payload)
            await asyncio.sleep(0)  # yield to the event loop during I/O

Serializing Decimal as a string rather than letting json reject it (or coercing to float) keeps the monetary contract intact all the way to the consumer, which parses it back into Decimal on receipt.

Production Hardening

Deploying this pipeline across hundreds of locations requires strict resource governance and idempotency guarantees:

Stream, don’t bulk-load. Parse PDFs page-by-page and yield validated records straight to the broker, so memory stays flat regardless of how large a recipe deck is. The extractor above already iterates pdf.pages lazily rather than materializing the whole document.
Idempotent retries. Wrap extraction and publishing in exponential backoff. A transient broker failure must not duplicate ingredient records, so derive a deduplication key of {recipe_id}_{ingredient_name}_{source_page}_{source_line} and enforce it at the consumer. Re-running the same PDF then converges to the same state instead of double-counting.
Unit normalization hooks. Attach the canonicalization consumers that translate regional packaging units (#10 can, flat, case) into base weight and volume before theoretical cost is computed. Keeping this out of the extractor preserves the layer’s single responsibility and lets the conversion matrix version independently.
Deterministic structured logging. Emit JSON logs with a correlation ID and the originating PDF hash on every extraction event. That trail is what enables audit reconciliation against vendor invoices and lets an operator trace a cost anomaly back to a specific document revision.

Failure Modes and Troubleshooting

The defects that hurt most in this layer are the quiet ones — the pipeline reports success while emitting subtly wrong data. Watch for these patterns and build detection for each:

Fraction misread as an integer. A recipe line 1 1/2 lb whose space is lost in extraction can parse as 11/2 or 1. The exact-rational conversion in Phase 2 guards the arithmetic, but the upstream regex must anchor the mixed-fraction form explicitly. Detection: assert that the count of quarantined quantity errors is near zero, and spot-check a sample of mixed-fraction lines per batch.
Silent line loss at page boundaries. An ingredient split across a page break produces a line that matches nothing and is dropped without error. Detection: compare the extracted ingredient count per recipe against a stored expectation or a header-declared count, and alert on a shortfall rather than trusting a clean run.
Regional unit aliases outside the vocabulary. #10 can, flat, or each (spelled out) fail the canonical-unit validator and land in quarantine. This is the intended behavior, but a spike in unit-validation quarantines signals a new vendor template that needs an alias mapping added upstream — not a code bug.
Empty-text pages read as empty recipes. A scanned page yields no text and the extractor skips it, which can make a full scanned deck look like an empty-but-successful batch. Detection: flag any PDF whose page count greatly exceeds its extracted-line count and route it to the OCR fallback for review instead of accepting the empty result.
Encoding drift in ingredient names. Ligatures and smart punctuation from design software can break downstream joins against the ingredient dictionary. Detection: normalize names to a canonical Unicode form at validation time and quarantine names containing unexpected control characters.

For reference on the regex engine’s behavior and performance characteristics, consult the official Python re module documentation; when tightening the validation boundary, align with the Pydantic V2 documentation for strict type coercion and custom validators. By enforcing deterministic extraction, rigid schema validation, and asynchronous decoupling, multi-unit operators turn static culinary PDFs into reliable, cost-ready data streams — eliminating manual transcription overhead and establishing a trustworthy foundation for menu engineering analytics.

Up: Data Ingestion & Recipe Parsing Workflows — the parent domain this extraction layer feeds.
Parsing PDF Menus with PyPDF2 and Regex — the concrete regex-driven implementation of Phase 1.
CSV Bulk Import Automation — the parallel ingestion vector for spreadsheet-sourced recipe data.
POS API Polling Strategies — cross-references extracted yields against live sales velocity.
Async Batch Processing Workflows — absorbs high-volume extraction so parsing never blocks calculation.
Yield Factor Calculation Frameworks — the downstream unit canonicalization and yield mapping this layer hands off to.