Skip to content

ENH: read_csv raises ParserError or hangs on certain malformed inputs #63599

@rmhowe425

Description

@rmhowe425

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When reading certain malformed CSV bytes, pandas.read_csv either raises a ParserError or takes a long time to process. While this does not crash Python or cause a security issue, it may indicate a performance edge case or area where the parser could handle extreme inputs more gracefully.

Reproduction steps
from io import BytesIO
from pandas import read_csv

data = b"\x09\x00\x31\x2d\x2e\x23\x51\x00\x61\x2c\x00\x22\x0d\x31\x0d\x0d\x20\x0d\x3a\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0a\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x2c\x21\x0d\x0d\x0d\x0d\x0d\x0d\x69\x6e\x66\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x2c\x0d\x0d\x20\x20\x2c\x22\x22\x22\x2c\x0a\x2c\x2c"

read_csv(BytesIO(data), encoding='latin1')

Observed behavior
ParserError: Buffer overflow caught - possible malformed input file is raised.
In fuzz testing, the parser may take a long time on certain inputs (timeout).

Expected behavior / suggestion
It’s acceptable to raise ParserError on malformed inputs.

Consider whether extremely long sequences of control characters could be handled more efficiently or rejected faster to improve robustness.

Impact
No crash, no memory corruption.
Not security-relevant (not CVE-worthy).
Mostly affects fuzzing / extreme edge-case inputs.

Image

Feature Description

  1. Define thresholds for pathological sequences:

    • Max consecutive control characters (e.g., \0, \r, \n, \t) in a row.
    • Max line length (already exists in many parsers, but can be tuned).
    • Max number of embedded quotes without separator.
  2. Scan input before tokenization (optional fast path):

    • Walk the input buffer, count consecutive control characters or other unusual patterns.
    • If thresholds exceeded → raise ParserError immediately.
  3. Modify _tokenize_helper / tokenizer logic:

    • Track how long tokenization has taken per unit (for fuzzing / long input detection).
    • If time exceeds a configurable limit → raise a timeout ParserError.
    • Optionally: track recursion depth for nested quotes/escapes.

Alternative Solutions

N/A

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions