-
-
Notifications
You must be signed in to change notification settings - Fork 19.5k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# pandas version 1.4.1
import pandas as pd
from django.db import connection
from datetime import date
month = date.today().month
year = date.today().year
query = f"""
SELECT
id AS ID,
CASE
WHEN MAX(CASE WHEN site = 'A' THEN rating END) IS NULL OR MAX(CASE WHEN site = 'A' THEN rating END) <= 0
THEN ''
ELSE MAX(CASE WHEN site = 'A' THEN rating END)
END AS A_RATING,
CASE
WHEN MAX(CASE WHEN site = 'A' THEN reviews END) IS NULL OR MAX(CASE WHEN site = 'A' THEN reviews END) <= 0
THEN ''
ELSE MAX(CASE WHEN site = 'A' THEN FORMAT(IFNULL(reviews, ''), 0) END)
END AS A_REVIEWS,
MAX(CASE WHEN site = 'A' THEN slug END) AS A_URL,
CASE
WHEN MAX(CASE WHEN site = 'B' THEN rating END) IS NULL OR MAX(CASE WHEN site = 'B' THEN rating END) <= 0
THEN ''
ELSE MAX(CASE WHEN site = 'B' THEN rating END)
END AS AR_RATING,
CASE
WHEN MAX(CASE WHEN site = 'B' THEN reviews END) IS NULL OR MAX(CASE WHEN site = 'B' THEN reviews END) <= 0
THEN ''
ELSE MAX(CASE WHEN site = 'B' THEN FORMAT(IFNULL(reviews, ''), 0) END)
END AS AR_REVIEWS,
MAX(CASE WHEN site = 'B' THEN slug END) AS AR_URL
FROM PropertyReviewInsights WHERE YEAR(added_on) = {year} AND MONTH(added_on) = {month}
GROUP BY id ORDER BY id;
"""
df = pd.read_sql(query, connection)
file_name = f"ABC.csv"
file_path = f'/srv/{file_name}'
df.to_csv(file_path, index=False)
connection.close()Issue Description
I am observing a discrepancy between the number of rows returned by pd.read_sql() and a manual DB-API cursor execution using the same SQL query and the same database connection.
Using a manual MySQL cursor execution, the query returns 117,464 rows
Using pd.read_sql(), the same query returns 116,383 rows
In addition to the reduced row count, I also observed that for approximately 5,000 rows, some columns are returned as blank / NULL values when using pd.read_sql(), whereas those same rows and columns contain valid data when fetched via the manual cursor approach.
The manual cursor approach consistently returns the full dataset with all columns populated correctly, while pd.read_sql() returns fewer rows and partially blank data without raising any error or warning.
This behavior is unexpected and may indicate an issue with how pandas iterates over or buffers database cursors, or how it handles row fetching and type conversion when interfacing with MySQL drivers.
Expected Behavior
pd.read_sql() should return all rows produced by the SQL query when using the same database connection and query as a manual DB-API cursor execution.
Specifically:
The row count returned by pd.read_sql() should exactly match the row count returned by a manual cursor.execute() followed by fetchall()
No rows should be dropped or skipped
No column values should be returned as blank or NULL if the underlying database row contains valid data
Any issues during data fetching, cursor iteration, or type conversion should raise a clear error or warning, rather than silently returning incomplete data
In this case, both approaches are expected to return 117,464 rows with fully populated columns, matching the database results exactly.
Installed Versions
Details
INSTALLED VERSIONS
commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-216-generic
Version : #236-Ubuntu SMP Fri Apr 11 19:53:21 UTC 2025
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : 7.0.1
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.13.4
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None