Data Ingestion Recipe Parsing Workflows

Parsing PDF Menus with PyPDF2 and Regex

This page walks a food-tech developer through the concrete task of turning a born-digital menu or vendor price PDF into typed, cost-ready records using pypdf (the maintained successor to PyPDF2) and a small set of atomic regular expressions. It is the hands-on implementation of Phase 1 described in the broader PDF Recipe Extraction Pipelines guide — read that first for the architecture rationale (why direct text extraction beats OCR), then follow the numbered steps here to stand up a working parser. The extractor sits at the head of the wider Data Ingestion & Recipe Parsing Workflows architecture, so everything it emits must be schema-clean before downstream unit normalization and margin math trust it.

The failure this prevents is silent numeric corruption: a price interleaved from the wrong column, a $12.00 that was really 1 2.00 oz, or a menu item swallowed by a page break. The parser below fails loudly at ingestion — attaching a page number to every reject — rather than shipping a plausible-but-wrong cost into a report.

Prerequisites and Data Contract

Pin these versions and agree the contract before the steps apply. The parser is deterministic only when both the input shape and the output schema are fixed.

Runtime: Python 3.11+, pypdf==5.* (PdfReader is a drop-in replacement for PyPDF2.PdfReader), pydantic==2.*, pandas==2.2.*, tenacity==9.*.
Input contract: a born-digital PDF with selectable text — a menu, a distributor price sheet, or a corporate item guideline. Category headers are uppercase; item lines follow ITEM NAME␠␠[description]␠␠$PRICE with column gaps of two-plus spaces or a tab. Scanned image-only PDFs are out of contract and must divert to the OCR fallback covered in the parent module.
Monetary rule: every price is carried as decimal.Decimal, never float. Binary floats cannot represent common cents such as 0.10, and that error compounds when downstream BOM cost roll-up arithmetic multiplies by yield across thousands of SKUs.

The output contract is one record per menu item, shaped like this table (the units stay raw here — canonicalization to base weight/volume happens downstream in the yield factor calculation frameworks):

Field	Type	Constraint
`category`	text	non-empty, upper-cased header context
`item_name`	text	non-empty after title-casing
`description`	text or null	optional
`unit_price`	Decimal	`>= 0`, parsed from the raw price token
`source_page`	int	zero-based page index for lineage
`raw_line`	text	verbatim source line for audit

Step-by-Step Implementation

Step 1 — Pin the library and normalize text blocks

PyPDF2 was merged back into pypdf; new pipelines install pypdf and the patterns below apply unchanged to both. Menu PDFs carry typographic ligatures, zero-width spaces, and hyphenated column wraps that break regex anchors, so normalize every block before matching.

from __future__ import annotations

import logging
import re
import unicodedata

logger = logging.getLogger(__name__)

_CONTROL = re.compile(r"[\x00-\x1F\x7F-\x9F\u200B-\u200D\uFEFF\u00AD]")
_HYPHEN_WRAP = re.compile(r"-\s*\n")
_INLINE_WS = re.compile(r"[^\S\n]+")  # collapse spaces/tabs but keep newlines


def normalize_block(text: str) -> str:
    """Strip control chars, fold ligatures, rejoin wraps, collapse inline whitespace."""
    text = unicodedata.normalize("NFKC", text)  # ﬁ/ﬂ ligatures -> fi/fl
    text = _CONTROL.sub("", text)
    text = _HYPHEN_WRAP.sub("", text)           # rejoin "mozza-\nrella"
    return _INLINE_WS.sub(" ", text).strip()

Step 2 — Stream pages with a memory-bounded generator

Yield one normalized page at a time so a 500-page vendor catalog never materializes in RAM. strict=False tolerates the malformed cross-reference tables common in third-party price sheets.

from collections.abc import Iterator

from pypdf import PdfReader
from pypdf.errors import PdfReadError


def stream_pages(pdf_path: str) -> Iterator[tuple[int, str]]:
    """Yield (page_index, normalized_text); skip pages with no extractable text."""
    try:
        reader = PdfReader(pdf_path, strict=False)
    except PdfReadError as exc:
        logger.error("cannot open %s: %s", pdf_path, exc)
        return
    for index, page in enumerate(reader.pages):
        raw = page.extract_text() or ""
        if raw.strip():
            yield index, normalize_block(raw)

Step 3 — Compile atomic regex patterns

Monolithic patterns invite catastrophic backtracking. Compile small, single-purpose patterns once and apply them in sequence. Anchoring the price to the end of the line keeps SKU numbers and portion weights from being read as money.

RE_CATEGORY = re.compile(r"^[A-Z][A-Z\s&]{3,}$")          # ALL-CAPS header, no digits
RE_PRICE = re.compile(
    r"(?:\$|USD\s?)?(\d+(?:\.\d{1,2})?)\s*(?:ea|lb|oz|each)?$", re.IGNORECASE
)
RE_ITEM = re.compile(
    r"^(?P<item>[A-Z][\w\s\-&']+?)"     # item name
    r"(?:\s{2,}|\t)"                    # column gap
    r"(?P<desc>[^$]*?)"                 # optional description
    r"(?:\s{2,}|\t)"                    # column gap
    r"(?P<price>\$?\d+(?:\.\d{2})?)\s*$"  # right-aligned price
)

Step 4 — Parse lines with a stateful category machine

Track the current category across lines and pages, and emit raw candidate dicts — do not coerce types here. Separating detection from validation keeps the regex engine from greedy-matching across unrelated blocks and lets the next step own every rejection.

def parse_page(page_no: int, text: str) -> Iterator[dict[str, object]]:
    """Emit raw candidate records; the current category is carried as state."""
    category = "UNCATEGORIZED"
    for line in (raw.strip() for raw in text.split("\n")):
        if not line:
            continue
        if RE_CATEGORY.match(line):
            category = line.upper()
            continue
        item = RE_ITEM.match(line)
        if item:
            yield {
                "category": category,
                "item_name": item.group("item"),
                "description": item.group("desc").strip() or None,
                "unit_price": item.group("price"),
                "source_page": page_no,
                "raw_line": line,
            }
            continue
        price = RE_PRICE.search(line)  # fallback for unstructured lines
        if price:
            yield {
                "category": category,
                "item_name": line[: price.start()].strip(),
                "description": None,
                "unit_price": price.group(1),
                "source_page": page_no,
                "raw_line": line,
            }

Step 5 — Validate into a typed schema and quarantine rejects

A Pydantic v2 model is the gate. It coerces the price token to Decimal, rejects empty item names, and — crucially — never silently drops a bad line. Rejects go to a quarantine list carrying their page and the validation error, exactly as the parent module’s quarantine routing prescribes.

from collections.abc import Iterable
from decimal import Decimal, InvalidOperation

from pydantic import BaseModel, ConfigDict, ValidationError, field_validator


class MenuRecord(BaseModel):
    model_config = ConfigDict(str_strip_whitespace=True, frozen=True)

    category: str
    item_name: str
    description: str | None
    unit_price: Decimal
    source_page: int
    raw_line: str

    @field_validator("unit_price", mode="before")
    @classmethod
    def _coerce_price(cls, value: object) -> Decimal:
        try:
            return Decimal(re.sub(r"[^\d.]", "", str(value)))
        except InvalidOperation as exc:
            raise ValueError(f"unparseable price: {value!r}") from exc

    @field_validator("item_name", "category")
    @classmethod
    def _non_empty(cls, value: str) -> str:
        if not value.strip():
            raise ValueError("empty required field")
        return value


def validate_records(
    candidates: Iterable[dict[str, object]],
) -> tuple[list[MenuRecord], list[dict[str, object]]]:
    valid: list[MenuRecord] = []
    quarantine: list[dict[str, object]] = []
    for candidate in candidates:
        try:
            valid.append(MenuRecord.model_validate(candidate))
        except ValidationError as exc:
            candidate["_errors"] = exc.errors(include_url=False)
            quarantine.append(candidate)
    return valid, quarantine

Step 6 — Assemble a deduplicated frame with exact Decimal money

Build the frame from validated records only. Keep unit_price as text so the exact Decimal survives the CSV round-trip, and derive the dedup key with a vectorized hash — no apply, no row-by-row iteration.

import pandas as pd


def build_menu_frame(records: list[MenuRecord]) -> pd.DataFrame:
    if not records:
        return pd.DataFrame()
    df = pd.DataFrame([r.model_dump() for r in records])
    df["unit_price"] = df["unit_price"].astype("string")  # keep Decimal exact as text
    df["item_name"] = df["item_name"].str.strip().str.title()
    df["category"] = df["category"].str.strip().str.title()
    # order-independent, vectorized dedup key for idempotent upserts
    key = df["category"].str.cat([df["item_name"], df["unit_price"]], sep="|")
    df["record_hash"] = pd.util.hash_pandas_object(key, index=False).astype("uint64")
    return df.drop_duplicates(subset="record_hash").reset_index(drop=True)

Step 7 — Wrap the run in a fault-tolerant batch runner

Compose the streaming generator, parser, validator, and frame builder behind exponential backoff. utf-8-sig prevents BOM parsing errors in Excel/BI consumers, and memory stays flat because only one page’s strings and the growing record lists live in RAM.

from pathlib import Path

from tenacity import retry, stop_after_attempt, wait_exponential


@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def process_catalog(pdf_path: str, out_dir: str) -> Path:
    out_file = Path(out_dir) / f"{Path(pdf_path).stem}_parsed.csv"
    candidates = (
        record
        for page_no, text in stream_pages(pdf_path)
        for record in parse_page(page_no, text)
    )
    valid, quarantined = validate_records(candidates)
    if quarantined:
        logger.warning("quarantined %d line(s) from %s", len(quarantined), pdf_path)
    frame = build_menu_frame(valid)
    frame.to_csv(out_file, index=False, encoding="utf-8-sig")
    logger.info("wrote %d validated record(s) to %s", len(frame), out_file)
    return out_file

Verification and Validation

Confirm each guarantee before trusting the output downstream.

Rejects are counted, not lost. Feed a deliberately broken line ("BADITEM no price here") and assert it lands in quarantined, not in the frame: valid, quarantined = validate_records([...]); assert len(quarantined) == 1.
Money stays exact. After build_menu_frame, the price column must still be text so the Decimal round-trips: assert frame["unit_price"].map(type).eq(str).all(). The first float anywhere on this column is a defect.
Category state carries across pages. Parse a two-page fixture where page 2 has no header and assert its items inherit page 1’s last category rather than falling back to UNCATEGORIZED.
Dedup actually dedups. assert frame["record_hash"].is_unique — a duplicate hash means two identical (category, item, price) rows survived and would double-count in a re-run.
A clean run logs one line. Watch for wrote N validated record(s) with a non-zero N and a quarantine warning of zero (or an explained count). Diff N against a header-declared item count where the vendor provides one.

Gotchas and Edge Cases

IEEE-754 drift on the price path. The moment a price is read into a pandas float column — even for a quick sanity sum — sub-cent error accumulates. Parse straight to Decimal in Step 5 and keep it as text through the CSV; recover exactness later, never after arithmetic.
Portion weights read as prices. A line ending ... 6 oz can trip a naive price match. The end-anchor plus the optional ea|lb|oz|each group in RE_PRICE absorbs the unit token so 6 is not mistaken for $6.00 — keep that anchor, and quarantine anything ambiguous rather than guessing.
Thousands separators. A price like 1,299.00 fails RE_PRICE because of the comma. For high-ticket catering sheets, strip grouping commas in normalize_block before matching, or widen the pattern — but never let , reach Decimal, which reads it as invalid.
Short all-caps false headers. BBQ or IPA can masquerade as a category header. The {3,} length floor and the no-digit rule filter most; add an allow-list of real section names if a specific template keeps mis-firing.
Empty-text (scanned) pages. An image-only page yields no text and stream_pages skips it, which can make a full scanned deck look like an empty-but-successful run. Flag any PDF whose page count greatly exceeds its extracted-line count and route it to the OCR fallback in PDF Recipe Extraction Pipelines instead of accepting the empty result.
Ligature and encoding drift in names. Design-software smart punctuation breaks joins against the ingredient dictionary and against POS taxonomy mappings. NFKC in Step 1 handles the common cases; quarantine any name still carrying control characters.

Frequently Asked Questions

Should I install PyPDF2 or pypdf?

Install pypdf. PyPDF2 is no longer developed — the project was merged back into pypdf, and pypdf.PdfReader is a drop-in replacement with a near-identical API. Every pattern on this page runs unchanged against both, so there is no migration cost beyond swapping the import.

Why regex over a coordinate-aware library like pdfplumber?

For predictable, text-based menu and price PDFs, regex on a normalized text stream is faster, dependency-light, and easy to audit line by line. Reach for pdfplumber or PyMuPDF when layout is genuinely two-dimensional — dense multi-column grids or tables where column boundaries carry meaning that a linear text stream loses. The parent module weighs that trade-off in full; this page is the direct-text path.

How do I stop prices from becoming floats?

Parse the price token to decimal.Decimal inside the Pydantic validator and store the column as pandas string so the exact value survives serialization. The only place a numeric type is applied is the downstream database, using NUMERIC rather than a binary float, so no intermediate step can introduce rounding error.

What happens to a line that matches no pattern?

Structured lines fail RE_ITEM, fall through to the price-only fallback, and — if that also misses — never reach validate_records, so they are simply not emitted. Lines that emit a candidate but fail validation (empty name, unparseable price) are added to the quarantine list with their page and error, so a human can inspect them. Nothing is silently coerced.

Will this handle a 500-page vendor catalog without exhausting memory?

Yes. stream_pages is a generator that yields one normalized page at a time, and the candidate pipeline is a generator expression, so only a single page’s text plus the accumulating record lists occupy RAM. For truly large fleets, dispatch each catalog through the async batch processing workflow so parsing never blocks calculation.

Up: PDF Recipe Extraction Pipelines — the parent module whose Phase 1 this page implements, including the OCR fallback path.
Data Ingestion & Recipe Parsing Workflows — the full ingestion architecture this extractor feeds.
CSV Bulk Import Automation — the parallel ingestion vector for spreadsheet-sourced menus.
Async Batch Processing Workflows — fan-out execution for high-volume, multi-catalog runs.
Yield Factor Calculation Frameworks — the downstream unit canonicalization these raw records hand off to.
Variance Mapping Methodologies — where parsed costs are diffed against a baseline.

For deeper reference, consult the official pypdf documentation, the Python re module documentation, and the Pydantic V2 documentation for strict coercion and custom validators.