-
-
Notifications
You must be signed in to change notification settings - Fork 19.5k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
When reading certain malformed CSV bytes, pandas.read_csv either raises a ParserError or takes a long time to process. While this does not crash Python or cause a security issue, it may indicate a performance edge case or area where the parser could handle extreme inputs more gracefully.
Reproduction steps
from io import BytesIO
from pandas import read_csv
data = b"\x09\x00\x31\x2d\x2e\x23\x51\x00\x61\x2c\x00\x22\x0d\x31\x0d\x0d\x20\x0d\x3a\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0a\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x2c\x21\x0d\x0d\x0d\x0d\x0d\x0d\x69\x6e\x66\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x0d\x2c\x0d\x0d\x20\x20\x2c\x22\x22\x22\x2c\x0a\x2c\x2c"
read_csv(BytesIO(data), encoding='latin1')
Observed behavior
ParserError: Buffer overflow caught - possible malformed input file is raised.
In fuzz testing, the parser may take a long time on certain inputs (timeout).
Expected behavior / suggestion
It’s acceptable to raise ParserError on malformed inputs.
Consider whether extremely long sequences of control characters could be handled more efficiently or rejected faster to improve robustness.
Impact
No crash, no memory corruption.
Not security-relevant (not CVE-worthy).
Mostly affects fuzzing / extreme edge-case inputs.
Feature Description
-
Define thresholds for pathological sequences:
- Max consecutive control characters (e.g., \0, \r, \n, \t) in a row.
- Max line length (already exists in many parsers, but can be tuned).
- Max number of embedded quotes without separator.
-
Scan input before tokenization (optional fast path):
- Walk the input buffer, count consecutive control characters or other unusual patterns.
- If thresholds exceeded → raise ParserError immediately.
-
Modify
_tokenize_helper/ tokenizer logic:- Track how long tokenization has taken per unit (for fuzzing / long input detection).
- If time exceeds a configurable limit → raise a timeout ParserError.
- Optionally: track recursion depth for nested quotes/escapes.
Alternative Solutions
N/A
Additional Context
No response