Skip to content


Repository files navigation


Documentation Status Python CI/CD codecov

A package to streamline common code chunks executed by students in the UBC MDS program circa 2025.


  • compare_model_scores() - a function that takes multiple models and returns a table of mean CV scores for each for easy comparison.
  • perform_eda() - a function to perform exploratory data analysis on a dataset
  • dataset_summary() - a function that generates a comprehensive summary of a dataset, including missing value statistics, feature counts, duplicate rows, and descriptive statistics.
  • htv() - (Hypothesis Test Visualization) provide good plots for user's hypothesis test result, easier to understand what happend in test rather than just number.

Similar packages

  • While this package extends cross-validation from scikit-learn, there are no known packages that provide CV score comparison similar to compare_model_scores(). The most similar is the summary_cv() function in the CrossPy package, which summarizes CV scores for a single model.

  • While the ProfileReport class from the ydata-profiling package provides automated exploratory data analysis and reporting, there are no known packages that offer the same level of flexible, on-demand visualizations and insights as the perform_eda() function. The most similar functionality is available in pandas-profiling, which generates detailed HTML reports but lacks the modular, interactive approach that perform_eda() provides for tailoring EDA to specific datasets and workflows.

  • The dataset_summary() function combines essential dataset insights—missing values, feature types, duplicates, and basic statistics—into one comprehensive and easy-to-use tool. While similar functionality exists in libraries like pandas-profiling and missingno, these tools focus on specific aspects or full-scale exploratory analysis. No single function consolidates all these features in one place, making dataset_summary() a uniquely efficient solution for preprocessing workflows.

  • There is no similar function could provide plot for hypothesis test output. Data Scientist do it manually, but it is not friendly for learner.


$ pip install mds_2025_helper_functions


Function Documentation and Usage

1. compare_model_scores


This function compares the mean cross-validation scores of multiple ML models and produces a summary table.


  • *args (BaseEstimator): Models to evaluate (e.g., LogisticRegression, RandomForestClassifier, etc.).
  • X (array-like): Training dataset of features with shape (n_samples, n_features).
  • y (array-like, optional): Target values for supervised learning tasks.
  • scoring (string or callable, optional): Evaluation metrics (e.g., "accuracy"). Refer to the Scikit-learn scoring documentation.
  • return_train_scores (bool): Whether to include training scores in addition to test scores. Default is False.
  • **kwargs: Additional arguments for sklearn.model_selection.cross_validate.


A pandas.DataFrame comparing the performance of the models.

Example Usage:

from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasetsimport warnings
warnings.filterwarning.import warningsfilterwarnings de```

```  from sklearn.datasets import load iris
It seems the last few messages got jumbled up. Let me correct it and present the complete optimized README usage documentation without unnecessary comments.


# Function Documentation and Usage

## 1. `compare_model_scores`

### Description:
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.

### Parameters:
- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
- `y` (array-like, optional): Target values for supervised learning tasks.
- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](
- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.

### Returns:
A `pandas.DataFrame` comparing the performance of the models.

### Example Usage:
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Load sample dataset
data = load_iris()
X, y = data["data"], data["target"]

# Compare models

2. perform_eda


A one-stop Exploratory Data Analysis (EDA) function to generate data summaries, spot missing values, visualize feature distributions, and detect outliers.


  • dataframe (pd.DataFrame): The input dataset for analysis.
  • rows (int): Number of rows in the grid layout for visualizations. Default is 5.
  • cols (int): Number of columns in the grid layout for visualizations. Default is 2.


  • Prints dataset statistics, missing values report, and an outlier summary.
  • Generates plots and visualizations using Matplotlib and Seaborn.

Example Usage:

from mds_2025_helper_functions.eda import perform_eda
import pandas as pd

data = {
    'Age': [25, 32, 47, 51, 62],
    'Salary': [50000, 60000, 120000, 90000, 85000],
    'Department': ['HR', 'Finance', 'IT', 'Finance', 'HR'],
df = pd.DataFrame(data)

perform_eda(df, rows=2, cols=2)

3. dataset_summary


Generates a summary of a dataset including missing values, feature types, duplicate rows, and other descriptive statistics.


  • data (pd.DataFrame): The dataset to summarize.


A dictionary containing:

  • 'missing_values': DataFrame of missing value counts and percentages.
  • 'feature_types': Counts of numerical and categorical features.
  • 'duplicates': Number of duplicate rows.
  • 'numerical_summary': Descriptive statistics for numerical columns.
  • 'categorical_summary': Unique value counts for categorical features.

Example Usage:

from mds_2025_helper_functions.dataset_summary import dataset_summary
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
    'Age': [25, 32, 47, None, 29],
    'Salary': [50000, 60000, 120000, None, 80000],
    'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
df = pd.DataFrame(data)

summary = dataset_summary(df)

4. htv


Visualizes Type I (α) and Type II (β) errors in hypothesis tests.


  • test_output (dict): Dictionary containing hypothesis test parameters:
    • 'mu0' (float): Mean under the null hypothesis (H₀).
    • 'mu1' (float): Mean under the alternative hypothesis (H₁).
    • 'sigma' (float): Standard deviation.
    • 'sample_size' (int): Sample size.
    • 'df' (int, optional): Degrees of freedom, required for 't' or 'chi2' tests.
    • 'df1', 'df2' (int, optional): For F-tests (anova).
  • test_type (str): Type of test ('z', 't', 'chi2', or 'anova').
  • alpha (float): Significance level for Type I error. Default is 0.05.
  • tail (str): 'one-tailed' or 'two-tailed'. Default is 'two-tailed'.


  • A tuple of (fig, ax) for plotting the visualization.

Example Usage:

from mds_2025_helper_functions.htv import htv
import matplotlib.pyplot as plt

test_params = {
    'mu0': 100,
    'mu1': 105,
    'sigma': 15,
    'sample_size': 30

fig, ax = htv(test_params, test_type="z", alpha=0.05, tail="two-tailed")
test_params_t = {
    'mu0': 0,
    'mu1': 1.5,
    'sigma': 1,
    'sample_size': 25

fig, ax = htv(test_params_t, test_type="t", alpha=0.01, tail="one-tailed")


  • Required imports:
    from mds_2025_helper_functions.scores import compare_model_scores
    from mds_2025_helper_functions.eda import perform_eda
    from mds_2025_helper_functions.dataset_summary import dataset_summary
    from mds_2025_helper_functions.htv import htv
    from sklearn.datasets import load_iris, load_diabetes
    from sklearn.dummy import DummyRegressor, DummyClassifier
    from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
    import matplotlib.pyplot as plt
    import pandas as pd
    import warnings

Testing Commands

1. Run All Tests


2. Run a Specific Test File


3. Run a Specific Test Function


4. Run Tests with Verbose Output

pytest -v

5. Run Tests with Coverage

pytest --cov=.

Generate an HTML coverage report:

pytest --cov=. --cov-report=html

6. Run Tests in Parallel (Optional)

Run tests with 4 parallel workers:

pytest -n 4

7. Clear Pytest Cache

pytest --cache-clear


Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.


Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett


mds_2025_helper_functions was created by Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett. It is licensed under the terms of the MIT license.


mds_2025_helper_functions was created with cookiecutter and the py-pkgs-cookiecutter template.