Data Ingestion Recipe Parsing Workflows

Parsing PDF Menus with PyPDF2 and Regex

Multi-unit operators and culinary managers routinely ingest vendor price sheets, seasonal menu updates, and franchise recipe manuals as unstructured PDFs. Translating these binary documents into deterministic food cost analytics requires a strict ingestion layer that bridges raw text streams and structured costing databases. This guide isolates a discrete pipeline step: extracting menu items, pricing, and category metadata using PyPDF2 and targeted regular expressions. The implementation prioritizes layout-aware text normalization, stateful regex matching, and memory-efficient generators suitable for high-volume batch environments. This extraction module operates as a foundational component within broader Data Ingestion & Recipe Parsing Workflows and must produce clean, schema-compliant records before downstream unit normalization and margin calculation.

1. PyPDF2 Text Extraction & Layout Normalization

PyPDF2 reads PDF objects at the byte level and reconstructs text streams using embedded font dictionaries and content operators. Menu PDFs rarely follow logical reading order; instead, they rely on absolute positioning, multi-column grids, and floating text boxes. Calling page.extract_text() naively concatenates strings based on internal stream ordering, which frequently interleaves prices with unrelated descriptions, merges adjacent columns, or splits single menu items across arbitrary line breaks.

To mitigate layout fragmentation without switching to heavier coordinate-extraction libraries, implement a heuristic normalization routine that groups text by visual line breaks and collapses PDF rendering artifacts before regex evaluation:

import PyPDF2
import re
import unicodedata
import logging
from typing import Generator, Tuple

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def normalize_text_block(text: str) -> str:
    """Strip control characters, normalize ligatures, and collapse whitespace."""
    # NFKC normalization handles typographic ligatures (e.g., fi, fl) common in vendor PDFs
    text = unicodedata.normalize("NFKC", text)
    # Remove zero-width spaces, soft hyphens, and PDF control characters
    text = re.sub(r"[\x00-\x1F\x7F-\x9F\u200B-\u200D\uFEFF\u00AD]", "", text)
    # Rejoin hyphenated line breaks caused by column wrapping
    text = re.sub(r"-\s*\n", "")
    # Normalize whitespace and preserve explicit newlines for line-by-line parsing
    return re.sub(r"[^\S\n]+", " ", text).strip()

def extract_pdf_pages_generator(pdf_path: str) -> Generator[Tuple[int, str], None, None]:
    """Yield (page_number, normalized_text) tuples to bound memory usage."""
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f, strict=False)
            for idx, page in enumerate(reader.pages):
                raw = page.extract_text()
                if raw:
                    yield idx, normalize_text_block(raw)
    except PyPDF2.errors.PdfReadError as e:
        logging.error(f"Failed to parse {pdf_path}: {e}")

Generator-based iteration prevents OOM errors when processing 500+ page vendor catalogs. Each yielded string is pre-sanitized to remove non-breaking hyphens, PDF rendering artifacts, and inconsistent whitespace that commonly break regex anchors. The strict=False flag in PdfReader tolerates malformed cross-reference tables common in third-party generated price sheets.

2. Deterministic Regex Pattern Engineering

Menu PDFs exhibit predictable typographic patterns: category headers in uppercase, item names followed by optional descriptions, and right-aligned pricing. Relying on monolithic regex patterns introduces catastrophic backtracking and brittle edge-case failures. Instead, compile discrete, atomic patterns and apply them sequentially within a state machine:

import re
from typing import Dict, Generator, Optional

# Pre-compile patterns for performance and readability
RE_CATEGORY = re.compile(r"^[A-Z][A-Z\s]{3,}$")
RE_PRICE = re.compile(r"(?:\$|USD\s?)?(\d+(?:\.\d{1,2})?)\s*(?:ea|lb|oz|each)?$", re.IGNORECASE)
RE_ITEM_LINE = re.compile(
    r"^(?P<item>[A-Z][\w\s\-&']+?)"
    r"(?:\s{2,}|\t)"
    r"(?P<desc>[^$]*?)"
    r"(?:\s{2,}|\t)"
    r"(?P<price>\$?\d+(?:\.\d{2})?)\s*$"
)

def parse_menu_lines(lines: list[str]) -> Generator[Dict[str, Optional[str]], None, None]:
    """Stateful parser tracking category context and extracting structured records."""
    current_category = "UNCATEGORIZED"
    
    for line in lines:
        line = line.strip()
        if not line:
            continue
            
        # Detect category headers (all caps, >3 chars, no digits)
        if RE_CATEGORY.match(line):
            current_category = line.upper()
            continue
            
        # Attempt structured item match
        match = RE_ITEM_LINE.match(line)
        if match:
            price_raw = match.group("price")
            price_val = float(re.sub(r"[^\d.]", "", price_raw))
            yield {
                "category": current_category,
                "item_name": match.group("item").strip(),
                "description": match.group("desc").strip() or None,
                "unit_price": price_val,
                "raw_line": line
            }
        else:
            # Fallback: try to extract price from unstructured lines
            price_match = RE_PRICE.search(line)
            if price_match:
                yield {
                    "category": current_category,
                    "item_name": line.split(price_match.group(0))[0].strip(),
                    "description": None,
                    "unit_price": float(price_match.group(1)),
                    "raw_line": line
                }

The parser maintains category state across pages, handles missing descriptions gracefully, and isolates pricing logic to prevent false positives on SKU numbers or portion weights. By separating category detection from item extraction, the regex engine avoids greedy matching across unrelated blocks.

3. Schema Validation & Pandas Integration

Extracted dictionaries must be coerced into a rigid schema before ingestion into costing engines. Pandas provides deterministic type casting, missing-value handling, and vectorized validation:

import pandas as pd
from typing import Dict

def build_menu_dataframe(records: list[Dict]) -> pd.DataFrame:
    """Enforce schema compliance and prepare for downstream analytics."""
    df = pd.DataFrame(records)
    
    if df.empty:
        return df
        
    # Enforce strict types
    df["unit_price"] = pd.to_numeric(df["unit_price"], errors="coerce")
    df["category"] = df["category"].astype("string")
    df["item_name"] = df["item_name"].astype("string")
    
    # Drop malformed rows (missing price or item name)
    df = df.dropna(subset=["unit_price", "item_name"])
    
    # Standardize whitespace and casing for deduplication
    df["item_name"] = df["item_name"].str.strip().str.title()
    df["category"] = df["category"].str.strip().str.title()
    
    # Add deterministic hash for audit trails
    df["record_hash"] = (
        df["category"] + "|" + df["item_name"] + "|" + df["unit_price"].astype(str)
    ).apply(lambda x: hash(x) & 0xFFFFFFFF)
    
    return df

This pipeline ensures that downstream PDF Recipe Extraction Pipelines receive normalized, deduplicated, and type-safe inputs. The record_hash column enables idempotent upserts into relational costing databases, preventing duplicate margin calculations during batch re-runs.

4. Operational Reliability in Batch Environments

Production menu parsing must tolerate malformed pages, encrypted vendor files, and inconsistent formatting across franchise regions. Wrap the extraction logic in a fault-tolerant batch runner with explicit retry boundaries and structured logging:

import logging
from pathlib import Path
from tenacity import retry, stop_after_attempt, wait_exponential

# `extract_pdf_pages_generator`, `parse_menu_lines`, and `build_menu_dataframe`
# are defined in the earlier blocks of this article.

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def process_vendor_catalog(pdf_path: str, output_dir: str) -> Path:
    """Execute full extraction pipeline with memory-safe streaming."""
    output_file = Path(output_dir) / f"{Path(pdf_path).stem}_parsed.csv"
    records = []
    
    for page_num, text in extract_pdf_pages_generator(pdf_path):
        lines = [l for l in text.split("\n") if l.strip()]
        for record in parse_menu_lines(lines):
            record["page_source"] = page_num
            records.append(record)
            
    df = build_menu_dataframe(records)
    df.to_csv(output_file, index=False, encoding="utf-8-sig")
    logging.info(f"Wrote {len(df)} validated records to {output_file}")
    return output_file

The tenacity decorator implements exponential backoff for transient I/O failures, while utf-8-sig encoding prevents BOM-related parsing errors in downstream Excel/BI tools. For multi-unit deployments, integrate this module with async batch schedulers to parallelize catalog ingestion across regional vendor feeds. Memory consumption remains bounded at ~O(1) relative to PDF page count, as only normalized strings and parsed dictionaries reside in RAM during execution.

By enforcing deterministic regex boundaries, stateful category tracking, and strict pandas schema validation, this extraction layer eliminates the ambiguity that typically derails automated food cost modeling. The resulting structured dataset provides a reliable foundation for unit conversion, yield factor application, and real-time margin tracking across enterprise-scale restaurant portfolios.