-
-
Couldn't load subscription status.
- Fork 19.2k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Add a small convenience API to provide a quick, per-dtype view of missing values in a DataFrame. The utility should list columns grouped by dtype with null counts and optional null percentages, and return both a one-row-per-dtype summary and a per-dtype detail table (columns + null counts).
This is a diagnostic convenience (similar in spirit to df.info(show_counts=True) but grouped by dtype and returning programmatic output).
Feature Description
Add a DataFrame accessor that provides a compact, programmatic summary of missing values grouped by column dtype.
@pd.api.extensions.register_dataframe_accessor("dtype_nulls")
class DtypeNullsAccessor:
def init(self, df):
self._df = df
def summary(self, include_pct: bool = True, sort_desc: bool = True):
"""
Return (summary_df, detail_dict)
Parameters
----------
include_pct : bool, default True
Include null_pct columns (percentage of nulls relative to len(df)).
sort_desc : bool, default True
Sort per-dtype detail tables by null_count descending when True.
Returns
-------
summary_df : pd.DataFrame
One row per dtype with columns:
- dtype : str (dtype string, e.g., 'float64', 'object')
- n_columns : int
- cols_with_nulls : int
- total_nulls : int
- avg_null_pct : float (if include_pct)
detail_dict : dict[str, pd.DataFrame]
Mapping dtype string -> DataFrame listing columns of that dtype with
columns ['column','null_count','null_pct'?] (null_pct present if include_pct).
"""
Implementation sketch / pseudocode:
nrows = len(df)
per_col = DataFrame({
"column": df.columns,
"dtype": df.dtypes.astype(str),
"null_count": df.isna().sum().values
})
if include_pct:
per_col["null_pct"] = per_col["null_count"] / (nrows if nrows else 1) * 100
detail = { dtype: g.sort_values("null_count", ascending=not sort_desc).reset_index(drop=True)
for dtype, g in per_col.groupby("dtype") }
agg = per_col.groupby("dtype").agg(
n_columns=("column","count"),
cols_with_nulls=("null_count", lambda s: (s>0).sum()),
total_nulls=("null_count","sum")
).reset_index()
if include_pct:
agg["avg_null_pct"] = per_col.groupby("dtype")["null_pct"].mean().values
return agg, detail
Expected behaviour / examples:
df = pd.DataFrame({
"a": [1, None, 3],
"b": [None, None, 2.0],
"c": ["x","y", None],
"d": [True, False, True]
})
summary, detail = df.dtype_nulls.summary()
summary: rows for 'float64', 'object', 'bool' with counts and percentages
detail['float64'] lists columns 'b' and 'a' with null_count and null_pct
Alternative Solutions
One-liner / ad-hoc: Users can already compute this with a short snippet:
(pd.DataFrame({'dtype': df.dtypes.astype(str), 'nulls': df.isna().sum()})
.reset_index()
.groupby('dtype')[['index','nulls']])
Additional Context
Related design rationale:
This feature is a convenience diagnostic that complements df.info() and profiling packages; it returns programmatic data structures (DataFrames and dict) so downstream tooling and tests can consume results.