Data Ingestion Recipe Parsing Workflows
PDF Recipe Extraction Pipelines
Multi-unit operators and culinary managers routinely receive standardized recipe cards, vendor specification sheets, and corporate menu guidelines as unstructured PDFs. Translating these static documents into actionable food cost metrics requires a deterministic extraction pipeline. Within the broader Data Ingestion & Recipe Parsing Workflows architecture, the PDF extraction layer serves as the foundational bridge between culinary documentation and dynamic cost modeling. This guide isolates a single discrete workflow: the sequential extraction, schema validation, and structured serialization of PDF recipe data prior to downstream unit normalization and cost calculation.
Phase 1: Deterministic Text Stream Extraction
Production-grade recipe extraction bypasses optical character recognition (OCR) in favor of direct text stream parsing. OCR introduces probabilistic character substitution errors that compound during cost aggregation. Direct text extraction prioritizes speed, byte-level reproducibility, and deterministic regex anchoring.
When processing corporate recipe templates, the baseline methodology aligns with Parsing PDF Menus with PyPDF2 and Regex. The extraction logic relies on anchored regular expressions that match culinary measurement conventions: fractional quantities (1/2, 3/4), decimal weights, and standardized unit abbreviations (oz, lb, kg, ea, gal, qt). Capture groups are strictly partitioned to isolate yield percentages, portion counts, and preparation steps, preventing procedural text from contaminating ingredient-level data.
import re
from typing import List, Optional
from dataclasses import dataclass
# Deterministic regex for culinary ingredient lines
INGREDIENT_PATTERN = re.compile(
r"^(?P<qty>(?:\d+\s*\d*\/\d+|\d+\.?\d*))\s*"
r"(?P<unit>(?:oz|lb|g|kg|ea|gal|qt|cup|tbsp|tsp|pinch|dash|to\s*taste))\s*"
r"(?P<name>[A-Za-z0-9\s\-\(\),]+?)(?:\s*[-–]\s*(?P<note>.+))?$",
re.IGNORECASE | re.MULTILINE
)
@dataclass
class RawIngredient:
quantity: str
unit: str
name: str
note: Optional[str] = None
page: int = 0
line_index: int = 0
def extract_text_stream(pdf_path: str) -> List[RawIngredient]:
"""Extract raw text from PDF and parse ingredient lines deterministically."""
# In production, use pdfplumber or PyPDF2 with explicit text extraction mode
# to preserve layout coordinates for downstream error mapping.
import pdfplumber
ingredients: List[RawIngredient] = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
text = page.extract_text()
if not text:
continue
for line_idx, line in enumerate(text.splitlines()):
match = INGREDIENT_PATTERN.match(line.strip())
if match:
ingredients.append(RawIngredient(
quantity=match.group("qty"),
unit=match.group("unit"),
name=match.group("name").strip(),
note=match.group("note"),
page=page_num,
line_index=line_idx
))
return ingredients
Phase 2: Schema Enforcement & Quarantine Routing
Raw extracted strings must immediately pass through a strict validation layer. Culinary managers require that ambiguous measurements or missing units trigger a quarantine queue rather than silent data corruption. The validation routine strips formatting artifacts, normalizes whitespace, and applies a deterministic mapping table to convert regional terminology into a unified ingredient dictionary.
Records failing schema constraints are routed to a structured error log with line-level context and PDF page coordinates, enabling rapid manual review without halting the batch process.
from pydantic import BaseModel, ValidationError, field_validator
from fractions import Fraction
from typing import List, Dict, Union
from dataclasses import dataclass
import logging
logger = logging.getLogger(__name__)
# Re-declared here so the block is self-contained; in production these come
# from a shared module alongside the extractor in the previous block.
@dataclass
class RawIngredient:
quantity: str
unit: str
name: str
note: "str | None" = None
page: int = 0
line_index: int = 0
class ValidatedIngredient(BaseModel):
recipe_id: str
ingredient_name: str
quantity: float
unit: str
yield_count: int = 1
theoretical_cost: float = 0.0
source_page: int
source_line: int
@field_validator("quantity", mode="before")
@classmethod
def normalize_quantity(cls, v: Union[str, float]) -> float:
"""Convert fractional strings to deterministic floats."""
if isinstance(v, float):
return v
try:
return float(Fraction(v.replace(" ", "")))
except ValueError:
raise ValueError(f"Non-numeric quantity detected: {v}")
def validate_and_quarantine(raw_ingredients: List[RawIngredient], recipe_id: str):
validated: List[ValidatedIngredient] = []
quarantine: List[Dict] = []
for raw in raw_ingredients:
try:
validated.append(ValidatedIngredient(
recipe_id=recipe_id,
ingredient_name=raw.name,
quantity=raw.quantity,
unit=raw.unit,
source_page=raw.page,
source_line=raw.line_index
))
except ValidationError as e:
quarantine.append({
"error": str(e),
"page": raw.page,
"line": raw.line_index,
"raw_text": f"{raw.quantity} {raw.unit} {raw.name}"
})
if quarantine:
logger.warning(f"Quarantined {len(quarantine)} records for manual review.")
# Persist to structured error log / S3 quarantine bucket in production
return validated
Phase 3: Async Serialization & Downstream Decoupling
Once validated, structured records are serialized to JSON and published to an asynchronous message queue. This decouples extraction from heavy downstream transformations. Food tech developers should configure the pipeline to emit extraction events asynchronously, allowing parallel consumers to handle CSV Bulk Import Automation for legacy vendor files or trigger POS API Polling Strategies to cross-reference extracted theoretical yields against actual sales velocity.
The async architecture prevents memory bottlenecks during high-volume batch processing and enables horizontal scaling of cost calculation workers.
import json
import asyncio
from typing import List
from contextlib import asynccontextmanager
# Production: use aioredis, aio_pika, or kafka-python for message brokering.
# `ValidatedIngredient` is defined in the previous block.
@asynccontextmanager
async def async_queue_publisher(queue_url: str):
"""Mock async queue publisher for demonstration. Replace with Redis/RabbitMQ client."""
print(f"Connecting to queue: {queue_url}")
yield {"publish": lambda msg: print(f"Published: {msg}")}
print("Queue connection closed.")
async def publish_validated_records(records: "List[ValidatedIngredient]", queue_url: str):
async with async_queue_publisher(queue_url) as queue:
for record in records:
payload = json.dumps(record.model_dump())
await queue["publish"](payload)
# Yield control to event loop to prevent blocking during I/O
await asyncio.sleep(0)
Production Hardening & Memory Optimization
Deploying this pipeline across hundreds of locations requires strict resource governance. Implement the following operational controls to maintain deterministic throughput:
- Stream Processing Over Bulk Loading: Parse PDFs page-by-page rather than loading entire documents into memory. Yield validated records immediately to the async queue to maintain a constant memory footprint regardless of recipe deck size.
- Idempotent Retry Logic: Wrap extraction and publishing routines in exponential backoff handlers. Transient network failures during queue publishing must not trigger duplicate ingredient records. Implement a deduplication key (
{recipe_id}_{ingredient_name}_{source_page}_{source_line}) at the consumer level. - Unit Normalization Hooks: Attach downstream Unit Normalization Scripts as queue consumers. These scripts convert regional units (e.g.,
#10 can,flat,case) into base weight/volume metrics before theoretical cost calculation. - Deterministic Logging: Enforce structured JSON logging with correlation IDs. Every extraction event must trace back to the original PDF hash, enabling audit trails for compliance and vendor invoice reconciliation.
For reference on Python’s native regex engine behavior and performance characteristics, consult the official Python re module documentation. When implementing data validation boundaries, align with the Pydantic V2 documentation for strict type coercion and custom validator patterns.
By enforcing deterministic extraction, rigid schema validation, and asynchronous decoupling, multi-unit operators can transform static culinary PDFs into reliable, cost-ready data streams. This pipeline eliminates manual transcription overhead, prevents silent data corruption, and establishes the foundation for real-time menu engineering analytics.