Skip to content

Commit

Permalink
Merge pull request #60 from UBC-MDS/fix-add-docs
Browse files Browse the repository at this point in the history
Fix add docs
  • Loading branch information
Iskanu93 authored Feb 1, 2025
2 parents 3ba6456 + f384a35 commit 86f1f0c
Show file tree
Hide file tree
Showing 4 changed files with 385 additions and 23 deletions.
270 changes: 269 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,275 @@ $ pip install mds_2025_helper_functions

## Usage

- TODO
# Function Documentation and Usage

## 1. `compare_model_scores`

### Description:
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.

### Parameters:
- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
- `y` (array-like, optional): Target values for supervised learning tasks.
- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.

### Returns:
A `pandas.DataFrame` comparing the performance of the models.

### Example Usage:
```python
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasetsimport warnings
warnings.filterwarning.import warningsfilterwarnings de```

``` from sklearn.datasets import load iris
It seems the last few messages got jumbled up. Let me correct it and present the complete optimized README usage documentation without unnecessary comments.

---

```markdown
# Function Documentation and Usage

## 1. `compare_model_scores`

### Description:
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.

### Parameters:
- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
- `y` (array-like, optional): Target values for supervised learning tasks.
- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.

### Returns:
A `pandas.DataFrame` comparing the performance of the models.

### Example Usage:
```python
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Load sample dataset
data = load_iris()
X, y = data["data"], data["target"]

# Compare models
compare_model_scores(
LogisticRegression(),
DecisionTreeClassifier(),
X=X,
y=y,
scoring="accuracy"
)
```

---

## 2. `perform_eda`

### Description:
A one-stop Exploratory Data Analysis (EDA) function to generate data summaries, spot missing values, visualize feature distributions, and detect outliers.

### Parameters:
- `dataframe` (pd.DataFrame): The input dataset for analysis.
- `rows` (int): Number of rows in the grid layout for visualizations. Default is 5.
- `cols` (int): Number of columns in the grid layout for visualizations. Default is 2.

### Returns:
- Prints dataset statistics, missing values report, and an outlier summary.
- Generates plots and visualizations using Matplotlib and Seaborn.

### Example Usage:
```python
from mds_2025_helper_functions.eda import perform_eda
import pandas as pd

data = {
'Age': [25, 32, 47, 51, 62],
'Salary': [50000, 60000, 120000, 90000, 85000],
'Department': ['HR', 'Finance', 'IT', 'Finance', 'HR'],
}
df = pd.DataFrame(data)

perform_eda(df, rows=2, cols=2)
```

---

## 3. `dataset_summary`

### Description:
Generates a summary of a dataset including missing values, feature types, duplicate rows, and other descriptive statistics.

### Parameters:
- `data` (pd.DataFrame): The dataset to summarize.

### Returns:
A dictionary containing:
- `'missing_values'`: DataFrame of missing value counts and percentages.
- `'feature_types'`: Counts of numerical and categorical features.
- `'duplicates'`: Number of duplicate rows.
- `'numerical_summary'`: Descriptive statistics for numerical columns.
- `'categorical_summary'`: Unique value counts for categorical features.

### Example Usage:
```python
from mds_2025_helper_functions.dataset_summary import dataset_summary
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
'Age': [25, 32, 47, None, 29],
'Salary': [50000, 60000, 120000, None, 80000],
'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
}
df = pd.DataFrame(data)

summary = dataset_summary(df)
print(summary['missing_values'])
print(summary['numerical_summary'])
print(summary['categorical_summary'])
```

---

## 4. `htv`

### Description:
Visualizes Type I (α) and Type II (β) errors in hypothesis tests.

### Parameters:
- `test_output` (dict): Dictionary containing hypothesis test parameters:
- `'mu0'` (float): Mean under the null hypothesis (H₀).
- `'mu1'` (float): Mean under the alternative hypothesis (H₁).
- `'sigma'` (float): Standard deviation.
- `'sample_size'` (int): Sample size.
- `'df'` (int, optional): Degrees of freedom, required for `'t'` or `'chi2'` tests.
- `'df1'`, `'df2'` (int, optional): For F-tests (`anova`).
- `test_type` (str): Type of test (`'z'`, `'t'`, `'chi2'`, or `'anova'`).
- `alpha` (float): Significance level for Type I error. Default is `0.05`.
- `tail` (str): `'one-tailed'` or `'two-tailed'`. Default is `'two-tailed'`.

### Returns:
- A tuple of `(fig, ax)` for plotting the visualization.

### Example Usage:
```python
from mds_2025_helper_functions.htv import htv
import matplotlib.pyplot as plt

test_params = {
'mu0': 100,
'mu1': 105,
'sigma': 15,
'sample_size': 30
}

fig, ax = htv(test_params, test_type="z", alpha=0.05, tail="two-tailed")
plt.show()
```

```python
test_params_t = {
'mu0': 0,
'mu1': 1.5,
'sigma': 1,
'sample_size': 25
}

fig, ax = htv(test_params_t, test_type="t", alpha=0.01, tail="one-tailed")
plt.show()
```

---

### Notes:
- Required imports:
```python
from mds_2025_helper_functions.scores import compare_model_scores
from mds_2025_helper_functions.eda import perform_eda
from mds_2025_helper_functions.dataset_summary import dataset_summary
from mds_2025_helper_functions.htv import htv
from sklearn.datasets import load_iris, load_diabetes
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
```


## Testing Commands

### 1. Run All Tests
```bash
pytest
```

---

### 2. Run a Specific Test File
```bash
pytest test_dataset_summary.py
pytest test_eda.py
pytest test_htv.py
pytest test_scores.py
```

---

### 3. Run a Specific Test Function
```bash
pytest test_dataset_summary.py::test_function_name
```

---

### 4. Run Tests with Verbose Output
```bash
pytest -v
```

---

### 5. Run Tests with Coverage
```bash
pytest --cov=.
```

Generate an HTML coverage report:
```bash
pytest --cov=. --cov-report=html
```

---

### 6. Run Tests in Parallel (Optional)
Run tests with 4 parallel workers:
```bash
pytest -n 4
```

---

### 7. Clear Pytest Cache
```bash
pytest --cache-clear
```

## Contributing

Expand Down
52 changes: 45 additions & 7 deletions src/mds_2025_helper_functions/dataset_summary.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import pandas as pd


def dataset_summary(data):
"""
Generates a comprehensive summary of a dataset.
Expand All @@ -20,17 +21,17 @@ def dataset_summary(data):
-------
dict
A dictionary containing the following keys:
- 'missing_values' (pd.DataFrame):
- 'missing_values' (pd.DataFrame):
Summary of missing values, including counts and percentages for each column.
- 'feature_types' (dict):
- 'feature_types' (dict):
Counts of numerical and categorical features in the dataset.
Format: {'numerical_features': int, 'categorical_features': int}.
- 'duplicates' (int):
- 'duplicates' (int):
The number of duplicate rows in the dataset.
- 'numerical_summary' (pd.DataFrame):
- 'numerical_summary' (pd.DataFrame):
Descriptive statistics for numerical columns.
- 'categorical_summary' (pd.DataFrame):
- 'categorical_summary' (pd.DataFrame):
Unique value counts for categorical columns.
Raises
Expand All @@ -39,8 +40,45 @@ def dataset_summary(data):
If the input is not a pandas DataFrame.
ValueError
If the DataFrame is empty or contains unsupported data types.
Example
-------
>>> import pandas as pd
>>> from mds_2025_helper_functions.dataset_summary import dataset_summary
>>>
>>> # Example dataset
>>> data = {
... 'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
... 'Age': [25, 32, 47, None, 29],
... 'Salary': [50000, 60000, 120000, None, 80000],
... 'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
... }
>>> df = pd.DataFrame(data)
>>> # Generate summary
>>> summary = dataset_summary(df)
>>> # Access individual components of the summary
>>> print(summary['missing_values']) # Missing values per column
>>> print(summary['feature_types']) # Count of numerical and categorical features
>>> print(summary['duplicates']) # Number of duplicate rows
>>> print(summary['numerical_summary']) # Descriptive statistics for numerical columns
>>> print(summary['categorical_summary']) # Unique values for categorical columns
>>> # A specific example interpretation:
# 'missing_values' contains:
# column missing_count missing_percentage
# 0 Name 1 20.0
# 1 Age 1 20.0
# 2 Salary 1 20.0
# 3 Department 0 0.0
>>> # 'feature_types' looks like:
# {'numerical_features': 2, 'categorical_features': 2}
>>> # 'duplicates' :
# 1 (One duplicate row based on the data)
"""

# Check input type
if not isinstance(data, pd.DataFrame):
raise TypeError("Input must be a pandas DataFrame")
Expand Down
Loading

0 comments on commit 86f1f0c

Please sign in to comment.