ENH: read_csv raises ParserError or hangs on certain malformed inputs

### Feature Type

- [ ] Adding new functionality to pandas

- [x] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description
When reading certain malformed CSV bytes, pandas.read_csv either raises a ParserError or takes a long time to process. While this does not crash Python or cause a security issue, it may indicate a performance edge case or area where the parser could handle extreme inputs more gracefully.

**Reproduction steps**
from io import BytesIO
from pandas import read_csv

data = b"\x09\x00\x31\x2d\x2e\x23\x51\x00\x61\x2c\x00\x22\x0d\x31\x0d\x0d\x20\x0d\x3a\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0a\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x2c\x21\x0d\x0d\x0d\x0d\x0d\x0d\x69\x6e\x66\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x2c\x0d\x0d\x20\x20\x2c\x22\x22\x22\x2c\x0a\x2c\x2c"

read_csv(BytesIO(data), encoding='latin1')


**Observed behavior**
ParserError: Buffer overflow caught - possible malformed input file is raised.
In fuzz testing, the parser may take a long time on certain inputs (timeout).

**Expected behavior / suggestion**
It’s acceptable to raise ParserError on malformed inputs.

Consider whether extremely long sequences of control characters could be handled more efficiently or rejected faster to improve robustness.

**Impact**
No crash, no memory corruption.
Not security-relevant (not CVE-worthy).
Mostly affects fuzzing / extreme edge-case inputs.

<img width="1897" height="686" alt="Image" src="https://github.com/user-attachments/assets/4fbdd46d-93cd-43b5-a3c6-ce4b643c5e30" />

### Feature Description

1. Define thresholds for pathological sequences:
   * Max consecutive control characters (e.g., \0, \r, \n, \t) in a row.
   * Max line length (already exists in many parsers, but can be tuned).
   * Max number of embedded quotes without separator.

2. Scan input before tokenization (optional fast path):
   * Walk the input buffer, count consecutive control characters or other unusual patterns.
   * If thresholds exceeded → raise ParserError immediately.

3. Modify `_tokenize_helper` / tokenizer logic:
   * Track how long tokenization has taken per unit (for fuzzing / long input detection).
   * If time exceeds a configurable limit → raise a timeout ParserError.
   * Optionally: track recursion depth for nested quotes/escapes.


### Alternative Solutions

N/A

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: read_csv raises ParserError or hangs on certain malformed inputs #63599

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: read_csv raises ParserError or hangs on certain malformed inputs #63599

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions