mds_2025_helper_functions

A package to streamline common code chunks executed by students in the UBC MDS program circa 2025.

Functions

compare_model_scores() - a function that takes multiple models and returns a table of mean CV scores for each for easy comparison.
perform_eda() - a function to perform exploratory data analysis on a dataset
dataset_summary() - a function that generates a comprehensive summary of a dataset, including missing value statistics, feature counts, duplicate rows, and descriptive statistics.
htv() - (Hypothesis Test Visualization) provide good plots for user's hypothesis test result, easier to understand what happend in test rather than just number.

Similar packages

While this package extends cross-validation from scikit-learn, there are no known packages that provide CV score comparison similar to compare_model_scores(). The most similar is the summary_cv() function in the CrossPy package, which summarizes CV scores for a single model.
While the ProfileReport class from the ydata-profiling package provides automated exploratory data analysis and reporting, there are no known packages that offer the same level of flexible, on-demand visualizations and insights as the perform_eda() function. The most similar functionality is available in pandas-profiling, which generates detailed HTML reports but lacks the modular, interactive approach that perform_eda() provides for tailoring EDA to specific datasets and workflows.
The dataset_summary() function combines essential dataset insights—missing values, feature types, duplicates, and basic statistics—into one comprehensive and easy-to-use tool. While similar functionality exists in libraries like pandas-profiling and missingno, these tools focus on specific aspects or full-scale exploratory analysis. No single function consolidates all these features in one place, making dataset_summary() a uniquely efficient solution for preprocessing workflows.
There is no similar function could provide plot for hypothesis test output. Data Scientist do it manually, but it is not friendly for learner.

Installation

$ pip install mds_2025_helper_functions

Usage

Function Documentation and Usage

1. `compare_model_scores`

Description:

This function compares the mean cross-validation scores of multiple ML models and produces a summary table.

Parameters:

*args (BaseEstimator): Models to evaluate (e.g., LogisticRegression, RandomForestClassifier, etc.).
X (array-like): Training dataset of features with shape (n_samples, n_features).
y (array-like, optional): Target values for supervised learning tasks.
scoring (string or callable, optional): Evaluation metrics (e.g., "accuracy"). Refer to the Scikit-learn scoring documentation.
return_train_scores (bool): Whether to include training scores in addition to test scores. Default is False.
**kwargs: Additional arguments for sklearn.model_selection.cross_validate.

Returns:

A pandas.DataFrame comparing the performance of the models.

Example Usage:

from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasetsimport warnings
warnings.filterwarning.import warningsfilterwarnings de```

```  from sklearn.datasets import load iris
It seems the last few messages got jumbled up. Let me correct it and present the complete optimized README usage documentation without unnecessary comments.

---

```markdown
# Function Documentation and Usage

## 1. `compare_model_scores`

### Description:
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.

### Parameters:
- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
- `y` (array-like, optional): Target values for supervised learning tasks.
- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.

### Returns:
A `pandas.DataFrame` comparing the performance of the models.

### Example Usage:
```python
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Load sample dataset
data = load_iris()
X, y = data["data"], data["target"]

# Compare models
compare_model_scores(
    LogisticRegression(),
    DecisionTreeClassifier(),
    X=X,
    y=y,
    scoring="accuracy"
)

2. `perform_eda`

Description:

A one-stop Exploratory Data Analysis (EDA) function to generate data summaries, spot missing values, visualize feature distributions, and detect outliers.

Parameters:

dataframe (pd.DataFrame): The input dataset for analysis.
rows (int): Number of rows in the grid layout for visualizations. Default is 5.
cols (int): Number of columns in the grid layout for visualizations. Default is 2.

Returns:

Prints dataset statistics, missing values report, and an outlier summary.
Generates plots and visualizations using Matplotlib and Seaborn.

Example Usage:

from mds_2025_helper_functions.eda import perform_eda
import pandas as pd

data = {
    'Age': [25, 32, 47, 51, 62],
    'Salary': [50000, 60000, 120000, 90000, 85000],
    'Department': ['HR', 'Finance', 'IT', 'Finance', 'HR'],
}
df = pd.DataFrame(data)

perform_eda(df, rows=2, cols=2)

3. `dataset_summary`

Description:

Generates a summary of a dataset including missing values, feature types, duplicate rows, and other descriptive statistics.

Parameters:

data (pd.DataFrame): The dataset to summarize.

Returns:

A dictionary containing:

'missing_values': DataFrame of missing value counts and percentages.
'feature_types': Counts of numerical and categorical features.
'duplicates': Number of duplicate rows.
'numerical_summary': Descriptive statistics for numerical columns.
'categorical_summary': Unique value counts for categorical features.

Example Usage:

from mds_2025_helper_functions.dataset_summary import dataset_summary
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
    'Age': [25, 32, 47, None, 29],
    'Salary': [50000, 60000, 120000, None, 80000],
    'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
}
df = pd.DataFrame(data)

summary = dataset_summary(df)
print(summary['missing_values'])
print(summary['numerical_summary'])
print(summary['categorical_summary'])

4. `htv`

Description:

Visualizes Type I (α) and Type II (β) errors in hypothesis tests.

Parameters:

test_output (dict): Dictionary containing hypothesis test parameters:
- 'mu0' (float): Mean under the null hypothesis (H₀).
- 'mu1' (float): Mean under the alternative hypothesis (H₁).
- 'sigma' (float): Standard deviation.
- 'sample_size' (int): Sample size.
- 'df' (int, optional): Degrees of freedom, required for 't' or 'chi2' tests.
- 'df1', 'df2' (int, optional): For F-tests (anova).
test_type (str): Type of test ('z', 't', 'chi2', or 'anova').
alpha (float): Significance level for Type I error. Default is 0.05.
tail (str): 'one-tailed' or 'two-tailed'. Default is 'two-tailed'.

Returns:

A tuple of (fig, ax) for plotting the visualization.

Example Usage:

from mds_2025_helper_functions.htv import htv
import matplotlib.pyplot as plt

test_params = {
    'mu0': 100,
    'mu1': 105,
    'sigma': 15,
    'sample_size': 30
}

fig, ax = htv(test_params, test_type="z", alpha=0.05, tail="two-tailed")
plt.show()

test_params_t = {
    'mu0': 0,
    'mu1': 1.5,
    'sigma': 1,
    'sample_size': 25
}

fig, ax = htv(test_params_t, test_type="t", alpha=0.01, tail="one-tailed")
plt.show()

Notes:

Required imports:

from mds_2025_helper_functions.scores import compare_model_scores
from mds_2025_helper_functions.eda import perform_eda
from mds_2025_helper_functions.dataset_summary import dataset_summary
from mds_2025_helper_functions.htv import htv
from sklearn.datasets import load_iris, load_diabetes
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Testing Commands

1. Run All Tests

pytest

2. Run a Specific Test File

pytest test_dataset_summary.py
pytest test_eda.py
pytest test_htv.py
pytest test_scores.py

3. Run a Specific Test Function

pytest test_dataset_summary.py::test_function_name

4. Run Tests with Verbose Output

pytest -v

5. Run Tests with Coverage

pytest --cov=.

Generate an HTML coverage report:

pytest --cov=. --cov-report=html

6. Run Tests in Parallel (Optional)

Run tests with 4 parallel workers:

pytest -n 4

7. Clear Pytest Cache

pytest --cache-clear

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributors

Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett

License

mds_2025_helper_functions was created by Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett. It is licensed under the terms of the MIT license.

Credits

mds_2025_helper_functions was created with cookiecutter and the py-pkgs-cookiecutter template.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github/workflows		.github/workflows
docs		docs
src/mds_2025_helper_functions		src/mds_2025_helper_functions
tests		tests
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
CHANGELOG.md		CHANGELOG.md
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

UBC-MDS/mds-2025-helper-functions

Folders and files

Latest commit

History

Repository files navigation

mds_2025_helper_functions

Functions

Similar packages

Installation

Usage

Function Documentation and Usage

1. compare_model_scores

Description:

Parameters:

Returns:

Example Usage:

2. perform_eda

Description:

Parameters:

Returns:

Example Usage:

3. dataset_summary

Description:

Parameters:

Returns:

Example Usage:

4. htv

Description:

Parameters:

Returns:

Example Usage:

Notes:

Testing Commands

1. Run All Tests

2. Run a Specific Test File

3. Run a Specific Test Function

4. Run Tests with Verbose Output

5. Run Tests with Coverage

6. Run Tests in Parallel (Optional)

7. Clear Pytest Cache

Contributing

Contributors

License

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 5

Languages

1. `compare_model_scores`

2. `perform_eda`

3. `dataset_summary`

4. `htv`

Packages