A package to streamline common code chunks executed by students in the UBC MDS program circa 2025.
- compare_model_scores() - a function that takes multiple models and returns a table of mean CV scores for each for easy comparison.
- perform_eda() - a function to perform exploratory data analysis on a dataset
- dataset_summary() - a function that generates a comprehensive summary of a dataset, including missing value statistics, feature counts, duplicate rows, and descriptive statistics.
- htv() - (Hypothesis Test Visualization) provide good plots for user's hypothesis test result, easier to understand what happend in test rather than just number.
-
While this package extends cross-validation from scikit-learn, there are no known packages that provide CV score comparison similar to compare_model_scores(). The most similar is the summary_cv() function in the CrossPy package, which summarizes CV scores for a single model.
-
While the
ProfileReport
class from the ydata-profiling package provides automated exploratory data analysis and reporting, there are no known packages that offer the same level of flexible, on-demand visualizations and insights as theperform_eda()
function. The most similar functionality is available in pandas-profiling, which generates detailed HTML reports but lacks the modular, interactive approach thatperform_eda()
provides for tailoring EDA to specific datasets and workflows. -
The
dataset_summary()
function combines essential dataset insights—missing values, feature types, duplicates, and basic statistics—into one comprehensive and easy-to-use tool. While similar functionality exists in libraries like pandas-profiling and missingno, these tools focus on specific aspects or full-scale exploratory analysis. No single function consolidates all these features in one place, makingdataset_summary()
a uniquely efficient solution for preprocessing workflows. -
There is no similar function could provide plot for hypothesis test output. Data Scientist do it manually, but it is not friendly for learner.
$ pip install mds_2025_helper_functions
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.
*args
(BaseEstimator): Models to evaluate (e.g.,LogisticRegression
,RandomForestClassifier
, etc.).X
(array-like): Training dataset of features with shape(n_samples, n_features)
.y
(array-like, optional): Target values for supervised learning tasks.scoring
(string or callable, optional): Evaluation metrics (e.g.,"accuracy"
). Refer to the Scikit-learn scoring documentation.return_train_scores
(bool): Whether to include training scores in addition to test scores. Default isFalse
.**kwargs
: Additional arguments forsklearn.model_selection.cross_validate
.
A pandas.DataFrame
comparing the performance of the models.
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasetsimport warnings
warnings.filterwarning.import warningsfilterwarnings de```
``` from sklearn.datasets import load iris
It seems the last few messages got jumbled up. Let me correct it and present the complete optimized README usage documentation without unnecessary comments.
---
```markdown
# Function Documentation and Usage
## 1. `compare_model_scores`
### Description:
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.
### Parameters:
- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
- `y` (array-like, optional): Target values for supervised learning tasks.
- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.
### Returns:
A `pandas.DataFrame` comparing the performance of the models.
### Example Usage:
```python
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# Load sample dataset
data = load_iris()
X, y = data["data"], data["target"]
# Compare models
compare_model_scores(
LogisticRegression(),
DecisionTreeClassifier(),
X=X,
y=y,
scoring="accuracy"
)
A one-stop Exploratory Data Analysis (EDA) function to generate data summaries, spot missing values, visualize feature distributions, and detect outliers.
dataframe
(pd.DataFrame): The input dataset for analysis.rows
(int): Number of rows in the grid layout for visualizations. Default is 5.cols
(int): Number of columns in the grid layout for visualizations. Default is 2.
- Prints dataset statistics, missing values report, and an outlier summary.
- Generates plots and visualizations using Matplotlib and Seaborn.
from mds_2025_helper_functions.eda import perform_eda
import pandas as pd
data = {
'Age': [25, 32, 47, 51, 62],
'Salary': [50000, 60000, 120000, 90000, 85000],
'Department': ['HR', 'Finance', 'IT', 'Finance', 'HR'],
}
df = pd.DataFrame(data)
perform_eda(df, rows=2, cols=2)
Generates a summary of a dataset including missing values, feature types, duplicate rows, and other descriptive statistics.
data
(pd.DataFrame): The dataset to summarize.
A dictionary containing:
'missing_values'
: DataFrame of missing value counts and percentages.'feature_types'
: Counts of numerical and categorical features.'duplicates'
: Number of duplicate rows.'numerical_summary'
: Descriptive statistics for numerical columns.'categorical_summary'
: Unique value counts for categorical features.
from mds_2025_helper_functions.dataset_summary import dataset_summary
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
'Age': [25, 32, 47, None, 29],
'Salary': [50000, 60000, 120000, None, 80000],
'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
}
df = pd.DataFrame(data)
summary = dataset_summary(df)
print(summary['missing_values'])
print(summary['numerical_summary'])
print(summary['categorical_summary'])
Visualizes Type I (α) and Type II (β) errors in hypothesis tests.
test_output
(dict): Dictionary containing hypothesis test parameters:'mu0'
(float): Mean under the null hypothesis (H₀).'mu1'
(float): Mean under the alternative hypothesis (H₁).'sigma'
(float): Standard deviation.'sample_size'
(int): Sample size.'df'
(int, optional): Degrees of freedom, required for't'
or'chi2'
tests.'df1'
,'df2'
(int, optional): For F-tests (anova
).
test_type
(str): Type of test ('z'
,'t'
,'chi2'
, or'anova'
).alpha
(float): Significance level for Type I error. Default is0.05
.tail
(str):'one-tailed'
or'two-tailed'
. Default is'two-tailed'
.
- A tuple of
(fig, ax)
for plotting the visualization.
from mds_2025_helper_functions.htv import htv
import matplotlib.pyplot as plt
test_params = {
'mu0': 100,
'mu1': 105,
'sigma': 15,
'sample_size': 30
}
fig, ax = htv(test_params, test_type="z", alpha=0.05, tail="two-tailed")
plt.show()
test_params_t = {
'mu0': 0,
'mu1': 1.5,
'sigma': 1,
'sample_size': 25
}
fig, ax = htv(test_params_t, test_type="t", alpha=0.01, tail="one-tailed")
plt.show()
- Required imports:
from mds_2025_helper_functions.scores import compare_model_scores from mds_2025_helper_functions.eda import perform_eda from mds_2025_helper_functions.dataset_summary import dataset_summary from mds_2025_helper_functions.htv import htv from sklearn.datasets import load_iris, load_diabetes from sklearn.dummy import DummyRegressor, DummyClassifier from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier import matplotlib.pyplot as plt import pandas as pd import warnings warnings.filterwarnings('ignore')
pytest
pytest test_dataset_summary.py
pytest test_eda.py
pytest test_htv.py
pytest test_scores.py
pytest test_dataset_summary.py::test_function_name
pytest -v
pytest --cov=.
Generate an HTML coverage report:
pytest --cov=. --cov-report=html
Run tests with 4 parallel workers:
pytest -n 4
pytest --cache-clear
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett
mds_2025_helper_functions
was created by Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett. It is licensed under the terms of the MIT license.
mds_2025_helper_functions
was created with cookiecutter
and the py-pkgs-cookiecutter
template.