Merge pull request #60 from UBC-MDS/fix-add-docs

Fix add docs
UBC-MDS · Feb 1, 2025 · 86f1f0c · 86f1f0c
2 parents 3ba6456 + f384a35
commit 86f1f0c
Show file tree

Hide file tree

Showing 4 changed files with 385 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -29,7 +29,275 @@ $ pip install mds_2025_helper_functions
 
 ## Usage
 
-- TODO
+# Function Documentation and Usage
+
+## 1. `compare_model_scores`
+
+### Description:
+This function compares the mean cross-validation scores of multiple ML models and produces a summary table.
+
+### Parameters:
+- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
+- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
+- `y` (array-like, optional): Target values for supervised learning tasks.
+- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
+- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
+- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.
+
+### Returns:
+A `pandas.DataFrame` comparing the performance of the models.
+
+### Example Usage:
+```python
+from mds_2025_helper_functions.scores import compare_model_scores
+from sklearn.datasets import load_iris
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.linear_model import LogisticRegression
+
+from sklearn.datasets import load_diabetes
+from sklearn.tree import DecisionTreeClassifier
+from sklearn import datasetsimport warnings
+warnings.filterwarning.import warningsfilterwarnings de```
+
+```  from sklearn.datasets import load iris
+It seems the last few messages got jumbled up. Let me correct it and present the complete optimized README usage documentation without unnecessary comments.
+
+---
+
+```markdown
+# Function Documentation and Usage
+
+## 1. `compare_model_scores`
+
+### Description:
+This function compares the mean cross-validation scores of multiple ML models and produces a summary table.
+
+### Parameters:
+- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
+- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
+- `y` (array-like, optional): Target values for supervised learning tasks.
+- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
+- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
+- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.
+
+### Returns:
+A `pandas.DataFrame` comparing the performance of the models.
+
+### Example Usage:
+```python
+from mds_2025_helper_functions.scores import compare_model_scores
+from sklearn.datasets import load_iris
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.linear_model import LogisticRegression
+
+# Load sample dataset
+data = load_iris()
+X, y = data["data"], data["target"]
+
+# Compare models
+compare_model_scores(
+    LogisticRegression(),
+    DecisionTreeClassifier(),
+    X=X,
+    y=y,
+    scoring="accuracy"
+)
+```
+
+---
+
+## 2. `perform_eda`
+
+### Description:
+A one-stop Exploratory Data Analysis (EDA) function to generate data summaries, spot missing values, visualize feature distributions, and detect outliers.
+
+### Parameters:
+- `dataframe` (pd.DataFrame): The input dataset for analysis.
+- `rows` (int): Number of rows in the grid layout for visualizations. Default is 5.
+- `cols` (int): Number of columns in the grid layout for visualizations. Default is 2.
+
+### Returns:
+- Prints dataset statistics, missing values report, and an outlier summary.
+- Generates plots and visualizations using Matplotlib and Seaborn.
+
+### Example Usage:
+```python
+from mds_2025_helper_functions.eda import perform_eda
+import pandas as pd
+
+data = {
+    'Age': [25, 32, 47, 51, 62],
+    'Salary': [50000, 60000, 120000, 90000, 85000],
+    'Department': ['HR', 'Finance', 'IT', 'Finance', 'HR'],
+}
+df = pd.DataFrame(data)
+
+perform_eda(df, rows=2, cols=2)
+```
+
+---
+
+## 3. `dataset_summary`
+
+### Description:
+Generates a summary of a dataset including missing values, feature types, duplicate rows, and other descriptive statistics.
+
+### Parameters:
+- `data` (pd.DataFrame): The dataset to summarize.
+
+### Returns:
+A dictionary containing:
+- `'missing_values'`: DataFrame of missing value counts and percentages.
+- `'feature_types'`: Counts of numerical and categorical features.
+- `'duplicates'`: Number of duplicate rows.
+- `'numerical_summary'`: Descriptive statistics for numerical columns.
+- `'categorical_summary'`: Unique value counts for categorical features.
+
+### Example Usage:
+```python
+from mds_2025_helper_functions.dataset_summary import dataset_summary
+import pandas as pd
+
+data = {
+    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
+    'Age': [25, 32, 47, None, 29],
+    'Salary': [50000, 60000, 120000, None, 80000],
+    'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
+}
+df = pd.DataFrame(data)
+
+summary = dataset_summary(df)
+print(summary['missing_values'])
+print(summary['numerical_summary'])
+print(summary['categorical_summary'])
+```
+
+---
+
+## 4. `htv`
+
+### Description:
+Visualizes Type I (α) and Type II (β) errors in hypothesis tests.
+
+### Parameters:
+- `test_output` (dict): Dictionary containing hypothesis test parameters:
+  - `'mu0'` (float): Mean under the null hypothesis (H₀).
+  - `'mu1'` (float): Mean under the alternative hypothesis (H₁).
+  - `'sigma'` (float): Standard deviation.
+  - `'sample_size'` (int): Sample size.
+  - `'df'` (int, optional): Degrees of freedom, required for `'t'` or `'chi2'` tests.
+  - `'df1'`, `'df2'` (int, optional): For F-tests (`anova`).
+- `test_type` (str): Type of test (`'z'`, `'t'`, `'chi2'`, or `'anova'`).
+- `alpha` (float): Significance level for Type I error. Default is `0.05`.
+- `tail` (str): `'one-tailed'` or `'two-tailed'`. Default is `'two-tailed'`.
+
+### Returns:
+- A tuple of `(fig, ax)` for plotting the visualization.
+
+### Example Usage:
+```python
+from mds_2025_helper_functions.htv import htv
+import matplotlib.pyplot as plt
+
+test_params = {
+    'mu0': 100,
+    'mu1': 105,
+    'sigma': 15,
+    'sample_size': 30
+}
+
+fig, ax = htv(test_params, test_type="z", alpha=0.05, tail="two-tailed")
+plt.show()
+```
+
+```python
+test_params_t = {
+    'mu0': 0,
+    'mu1': 1.5,
+    'sigma': 1,
+    'sample_size': 25
+}
+
+fig, ax = htv(test_params_t, test_type="t", alpha=0.01, tail="one-tailed")
+plt.show()
+```
+
+---
+
+### Notes:
+- Required imports:
+  ```python
+  from mds_2025_helper_functions.scores import compare_model_scores
+  from mds_2025_helper_functions.eda import perform_eda
+  from mds_2025_helper_functions.dataset_summary import dataset_summary
+  from mds_2025_helper_functions.htv import htv
+  from sklearn.datasets import load_iris, load_diabetes
+  from sklearn.dummy import DummyRegressor, DummyClassifier
+  from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
+  import matplotlib.pyplot as plt
+  import pandas as pd
+  import warnings
+  warnings.filterwarnings('ignore')
+  ```
+
+
+## Testing Commands
+
+### 1. Run All Tests
+```bash
+pytest
+```
+
+---
+
+### 2. Run a Specific Test File
+```bash
+pytest test_dataset_summary.py
+pytest test_eda.py
+pytest test_htv.py
+pytest test_scores.py
+```
+
+---
+
+### 3. Run a Specific Test Function
+```bash
+pytest test_dataset_summary.py::test_function_name
+```
+
+---
+
+### 4. Run Tests with Verbose Output
+```bash
+pytest -v
+```
+
+---
+
+### 5. Run Tests with Coverage
+```bash
+pytest --cov=.
+```
+
+Generate an HTML coverage report:
+```bash
+pytest --cov=. --cov-report=html
+```
+
+---
+
+### 6. Run Tests in Parallel (Optional)
+Run tests with 4 parallel workers:
+```bash
+pytest -n 4
+```
+
+---
+
+### 7. Clear Pytest Cache
+```bash
+pytest --cache-clear
+```
 
 ## Contributing
 

diff --git a/src/mds_2025_helper_functions/dataset_summary.py b/src/mds_2025_helper_functions/dataset_summary.py
@@ -1,5 +1,6 @@
 import pandas as pd
 
+
 def dataset_summary(data):
     """
     Generates a comprehensive summary of a dataset.
@@ -20,17 +21,17 @@ def dataset_summary(data):
     -------
     dict
         A dictionary containing the following keys:
-        
-        - 'missing_values' (pd.DataFrame): 
+
+        - 'missing_values' (pd.DataFrame):
             Summary of missing values, including counts and percentages for each column.
-        - 'feature_types' (dict): 
+        - 'feature_types' (dict):
             Counts of numerical and categorical features in the dataset.
             Format: {'numerical_features': int, 'categorical_features': int}.
-        - 'duplicates' (int): 
+        - 'duplicates' (int):
             The number of duplicate rows in the dataset.
-        - 'numerical_summary' (pd.DataFrame): 
+        - 'numerical_summary' (pd.DataFrame):
             Descriptive statistics for numerical columns.
-        - 'categorical_summary' (pd.DataFrame): 
+        - 'categorical_summary' (pd.DataFrame):
             Unique value counts for categorical columns.
 
     Raises
@@ -39,8 +40,45 @@ def dataset_summary(data):
         If the input is not a pandas DataFrame.
     ValueError
         If the DataFrame is empty or contains unsupported data types.
+
+    Example
+    -------
+    >>> import pandas as pd
+    >>> from mds_2025_helper_functions.dataset_summary import dataset_summary
+    >>>
+    >>> # Example dataset
+    >>> data = {
+    ...     'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
+    ...     'Age': [25, 32, 47, None, 29],
+    ...     'Salary': [50000, 60000, 120000, None, 80000],
+    ...     'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
+    ... }
+    >>> df = pd.DataFrame(data)
+
+    >>> # Generate summary
+    >>> summary = dataset_summary(df)
+
+    >>> # Access individual components of the summary
+    >>> print(summary['missing_values'])  # Missing values per column
+    >>> print(summary['feature_types'])   # Count of numerical and categorical features
+    >>> print(summary['duplicates'])      # Number of duplicate rows
+    >>> print(summary['numerical_summary'])  # Descriptive statistics for numerical columns
+    >>> print(summary['categorical_summary'])  # Unique values for categorical columns
+
+    >>> # A specific example interpretation:
+    # 'missing_values' contains:
+    #       column      missing_count    missing_percentage
+    # 0       Name                 1                  20.0
+    # 1        Age                 1                  20.0
+    # 2     Salary                 1                  20.0
+    # 3  Department                 0                   0.0
+
+    >>> # 'feature_types' looks like:
+    # {'numerical_features': 2, 'categorical_features': 2}
+
+    >>> # 'duplicates' :
+    # 1 (One duplicate row based on the data)
     """
-
     # Check input type
     if not isinstance(data, pd.DataFrame):
         raise TypeError("Input must be a pandas DataFrame")