Merge pull request #27 from satra/fix-group

satra · web-flow · commit 96dd28adfa85 · 2020-06-20T19:10:22.000-04:00
enh: add group variable, regression test, update README
diff --git a/README.md b/README.md
@@ -29,22 +29,22 @@ pip install pydra-ml
 
 This repo installs `pydraml` a CLI to allow usage without any programming.
 
-To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and 
+To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and
 `short-spec.json.sample` to a folder and run.
 
 ```
 $ pydraml -s short-spec.json.sample
 ```
-To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and `diabetes_spec.json`
-to a folder and run.
+To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and
+`diabetes_spec.json` to a folder and run.
 
 ```
 $ pydraml -s diabetes_spec.json
 ```
 
-For each case pydra-ml will generate a result folder with the spec file name that includes
-`test-{metric}-{timestamp}.png` file for each metric together with a pickled results file 
-containing all the scores from the model evaluations.
+For each case pydra-ml will generate a result folder with the spec file name that
+includes `test-{metric}-{timestamp}.png` file for each metric together with a
+pickled results file containing all the scores from the model evaluations.
 
 ```
 $ pydraml --help
@@ -82,14 +82,17 @@ will want to generate `x_indices` programmatically.
   group.
 - *x_indices*: Numeric (0-based) or string list of columns to use as input features
 - *target_vars*: String list of target variable (at present only one is supported)
+- *group_var*: String to indicate column to use for grouping
 - *n_splits*: Number of shuffle split iterations to use
 - *test_size*: Fraction of data to use for test set in each iteration
 - *clf_info*: List of scikit-learn classifiers to use.
 - *permute*: List of booleans to indicate whether to generate a null model or not
 - *gen_shap*: Boolean indicating whether shap values are generated
 - *nsamples*: Number of samples to use for shap estimation
 - *l1_reg*: Type of regularizer to use for shap estimation
-- *plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16 or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot top first feature.
+- *plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16
+or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot
+top first feature.
 - *metrics*: scikit-learn metric to use
 
 ## `clf_info` specification
@@ -113,6 +116,7 @@ then an empty dictionary **MUST** be provided as parameter 3.
  "x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
  18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
  "target_vars": ["target"],
+ "group_var": null,
  "n_splits": 100,
  "test_size": 0.2,
  "clf_info": [
@@ -140,25 +144,51 @@ then an empty dictionary **MUST** be provided as parameter 3.
 
 ## Output:
 The workflow will output:
-- `results-{timestamp}.pkl` containing 1 list per model used. For example, if assigned to variable `results`, it is accessed through `results[0]` to `results[N]`
-(if `permute: [false,true]` then it will output the model trained on the labels first `results[0]` and the model trained on permuted labels second `results[1]`.
+- `results-{timestamp}.pkl` containing 1 list per model used. For example, if
+assigned to variable `results`, it is accessed through `results[0]` to `results[N]`
+(if `permute: [false,true]` then it will output the model trained on the labels
+first `results[0]` and the model trained on permuted labels second `results[1]`.
 Each model contains:
-    - `dict` accesed through `results[0][0]` with model information: `{'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier', {'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}`
-    - `pydra Result obj` accesed through `results[0][1]` with attribute `output` which itself has attributes:
+    - `dict` accesed through `results[0][0]` with model information:
+     `{'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier',
+         {'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}`
+    - `pydra Result obj` accesed through `results[0][1]` with attribute `output`
+      which itself has attributes:
         - `feature_names`: from the columns of the data csv.
         And the following attributes organized in N lists for N bootstrapping samples:
         - `output`: N lists, each one with two lists for true and predicted labels.
         - `score`: N lists each one containing M different metric scores.
-        - `shaps`: N lists each one with a list of shape (P,F) where P is the amount of predictions and F the different SHAP values for each feature. `shaps` is empty if `gen_shap` is set to `false` or if `permute` is set to true.
-- One figure per metric with performance distribution across splits (with or without null distribution trained on permuted labels)
+        - `shaps`: N lists each one with a list of shape (P,F) where P is the
+        amount of predictions and F the different SHAP values for each feature.
+        `shaps` is empty if `gen_shap` is set to `false` or if `permute` is set
+        to true.
+- One figure per metric with performance distribution across splits (with or
+without null distribution trained on permuted labels)
+- One figure per any metric with the word `score` in it reporting the results of
+a Wilcoxon signed rank test. The figure reports one-sided stats values as the
+color of each cell and the corresponding `-log10(pvalue)` as the annotation.
+Higher numbers indicate stronger effect (color) and lower p-values (annotation).
+The actual numeric values are stored in a correspondingly named pkl file.
 - `shap-{timestamp}` dir
     - SHAP values are computed for each prediction in each split's test set
-    (e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array). The mean is taken across predictions for each split (e.g., resulting in a (64,30) array for 64 features and 30 bootstrapping samples).
-    - For binary classification, a more accurate display of feature importance obtained by splitting predictions into TP, TN, FP, and FN,
-    which in turn can allow for error auditing (i.e., what a model pays attention to when making incorrect/false predictions)
-        - `quadrant_indexes.pkl`: The TP, TN, FP, FN indexes are saved in  as a `dict` with one `key` per model (permuted models without SHAP values will be skipped automatically), and each key `values` being a bootstrapping split.
-        - `summary_values_shap_{model_name}_{prediction_type}.csv` contains all SHAP values and summary statistics ranked by the mean SHAP value across bootstrapping splits. A sample_n column can be empty or NaN if this split did not have the type of prediction in the filename (e.g., you may not have FNs or FPs in a given split with high performance).
-        - `summary_shap_{model_name}_{plot_top_n_shap}.png` contains SHAP value summary statistics for all features (set to 1.0) or only the top N most important features for better visualization.
+    (e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array).
+     The mean is taken across predictions for each split (e.g., resulting in a
+     (64,30) array for 64 features and 30 bootstrapping samples).
+    - For binary classification, a more accurate display of feature importance
+    obtained by splitting predictions into TP, TN, FP, and FN, which in turn can
+    allow for error auditing (i.e., what a model pays attention to when making
+    incorrect/false predictions)
+        - `quadrant_indexes.pkl`: The TP, TN, FP, FN indexes are saved in  as a
+        `dict` with one `key` per model (permuted models without SHAP values will
+        be skipped automatically), and each key `values` being a bootstrapping split.
+        - `summary_values_shap_{model_name}_{prediction_type}.csv` contains all
+        SHAP values and summary statistics ranked by the mean SHAP value across
+        bootstrapping splits. A sample_n column can be empty or NaN if this split
+        did not have the type of prediction in the filename (e.g., you may not
+        have FNs or FPs in a given split with high performance).
+        - `summary_shap_{model_name}_{plot_top_n_shap}.png` contains SHAP value
+        summary statistics for all features (set to 1.0) or only the top N most
+        important features for better visualization.
 
 
 ## Developer installation
@@ -171,10 +201,14 @@ cd pydra-ml
 pip install -e .[dev]
 ```
 
-It is also useful to install pre-commit:
+It is also useful to install pre-commit, which takes care of styling when
+committing code. When pre-commit is used you may have to run git commit twice,
+since pre-commit may make additional changes to your code for styling and will
+not commit these changes by default:
+
 ```
 pip install pre-commit
-pre-commit
+pre-commit install
 ```
 
 ### Project structure
diff --git a/diabetes_spec.json b/diabetes_spec.json
@@ -1,6 +1,7 @@
-{"filename": "./diabetes_table.csv",
+{"filename": "diabetes_table.csv",
  "x_indices": [0,1,2,3,4,5,6,7,8,9],
  "target_vars": ["target"],
+ "group_var": null,
  "n_splits": 4,
  "test_size": 0.2,
  "clf_info": [
@@ -14,4 +15,4 @@
   "l1_reg": "aic",
   "plot_top_n_shap": 10,
   "metrics":["explained_variance_score","mean_squared_error","mean_absolute_error"]
-}
+}
diff --git a/long-spec.json.sample b/long-spec.json.sample
@@ -2,6 +2,7 @@
  "x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
  18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
  "target_vars": ["target"],
+ "group_var": null,
  "n_splits": 100,
  "test_size": 0.2,
  "clf_info": [
diff --git a/pydra_ml/classifier.py b/pydra_ml/classifier.py
@@ -55,6 +55,7 @@ def gen_workflow(inputs, cache_dir=None, cache_locations=None):
             filename=wf.lzin.filename,
             x_indices=wf.lzin.x_indices,
             target_vars=wf.lzin.target_vars,
+            group=wf.lzin.group_var,
         )
     )
     wf.add(
diff --git a/pydra_ml/report.py b/pydra_ml/report.py
@@ -4,7 +4,8 @@
 import pickle
 import pandas as pd
 import numpy as np
-from sklearn.metrics import accuracy_score, explained_variance_score
+from sklearn.metrics import explained_variance_score
+from scipy.stats import wilcoxon
 import seaborn as sns
 import matplotlib.pyplot as plt
 
@@ -42,7 +43,6 @@ def plot_summary(summary, output_dir=None, filename="shap_plot", plot_top_n_shap
     plt.tight_layout()
     plt.show(block=False)
     plt.savefig(output_dir + f"summary_{filename}.png", dpi=100)
-    return
 
 
 def shaps_to_summary(
@@ -73,7 +73,6 @@ def shaps_to_summary(
         filename=filename,
         plot_top_n_shap=plot_top_n_shap,
     )
-    return
 
 
 def gen_report_shap_class(results, output_dir="./", plot_top_n_shap=16):
@@ -112,7 +111,7 @@ def gen_report_shap_class(results, output_dir="./", plot_top_n_shap=16):
             shaps_i = shaps[split_i]  # all shap values for this bootstrapping split
             y_true = y_true_and_preds[split_i][0]
             y_pred = y_true_and_preds[split_i][1]
-            #split_performance = accuracy_score(y_true, y_pred)
+            # split_performance = accuracy_score(y_true, y_pred)
             split_performance = explained_variance_score(y_true, y_pred)
 
             # split prediction indexes into TP, TN, FP, FN, good for error auditing
@@ -166,7 +165,7 @@ def gen_report_shap_class(results, output_dir="./", plot_top_n_shap=16):
             plot_top_n_shap=plot_top_n_shap,
         )
     save_obj(indexes_all, shap_dir + "indexes_quadrant.pkl")
-    return
+
 
 def gen_report_shap_regres(results, output_dir="./", plot_top_n_shap=16):
     # Create shap_dir
@@ -197,7 +196,7 @@ def gen_report_shap_regres(results, output_dir="./", plot_top_n_shap=16):
             "lp": [],
             "lm": [],
             "um": [],
-            "up": []   
+            "up": [],
         }  # this is key with shape (F, N) where F is feature_names, N is mean shap values across splits
         # Obtain values for each bootstrapping split, then append summary statistics to shaps_n_splits
         for split_i in range(n_splits):
@@ -208,8 +207,8 @@ def gen_report_shap_regres(results, output_dir="./", plot_top_n_shap=16):
 
             # split prediction indexes into upper, median, lower, good for error auditing
             indexes = {"lp": [], "lm": [], "um": [], "up": []}
-            q=np.array([25,50,75])
-            prc=np.percentile(y_true,q)
+            q = np.array([25, 50, 75])
+            prc = np.percentile(y_true, q)
             for i in range(len(y_true)):
                 if prc[0] >= y_pred[i]:
                     indexes["lp"].append(i)
@@ -259,7 +258,45 @@ def gen_report_shap_regres(results, output_dir="./", plot_top_n_shap=16):
             plot_top_n_shap=plot_top_n_shap,
         )
     save_obj(indexes_all, shap_dir + "indexes_quadrant.pkl")
-    return
+
+
+def compute_pairwise_stats(df):
+    """Run Wilcoxon signed rank tests across pairs of classifiers.
+
+    When comparing a classifier to itself, compare to its null distribution.
+    A one sided test is used.
+
+    Assumes that the dataframe has three keys: Classifier, type, and score
+    with type referring to either the data distribution or the null distribution
+
+    """
+    N = len(df.Classifier.unique())
+    effects = np.zeros((N, N)) * np.nan
+    pvalues = np.zeros((N, N)) * np.nan
+    for idx1, group1 in enumerate(df.groupby("Classifier")):
+        filter = group1[1].apply(lambda x: x.type == "data", axis=1).values
+        group1df = group1[1].iloc[filter, :]
+        filter = group1[1].apply(lambda x: x.type == "null", axis=1).values
+        group1nulldf = group1[1].iloc[filter, :]
+        for idx2, group2 in enumerate(df.groupby("Classifier")):
+            filter = group2[1].apply(lambda x: x.type == "data", axis=1).values
+            group2df = group2[1].iloc[filter, :]
+            if group1[0] != group2[0]:
+                stat, pval = wilcoxon(
+                    group1df["score"].values,
+                    group2df["score"].values,
+                    alternative="greater",
+                )
+            else:
+                stat, pval = wilcoxon(
+                    group1df["score"].values,
+                    group1nulldf["score"].values,
+                    alternative="greater",
+                )
+            effects[idx1, idx2] = stat
+            pvalues[idx1, idx2] = pval
+    return effects, pvalues
+
 
 def gen_report(
     results, prefix, metrics, gen_shap=True, output_dir="./", plot_top_n_shap=16
@@ -284,6 +321,7 @@ def gen_report(
                     },
                     ignore_index=True,
                 )
+    order = [group[0] for group in df.groupby("Classifier")]
     for name, subdf in df.groupby("metric"):
         sns.set(style="whitegrid", palette="pastel", color_codes=True)
         sns.set_context("talk")
@@ -296,7 +334,7 @@ def gen_report(
             split=True,
             inner="quartile",
             hue_order=["data", "null"],
-            order=[group[0] for group in df.groupby("Classifier")],
+            order=order,
         )
         ax.set_ylabel(name)
         sns.despine(left=True)
@@ -306,16 +344,43 @@ def gen_report(
         timestamp = timestamp.replace(":", "").replace("-", "")
         plt.savefig(f"test-{name}-{timestamp}.png")
 
+        # Create comparison stats table if the metric is a score
+        if "score" in name:
+            effects, pvalues, = compute_pairwise_stats(subdf)
+            plt.figure(figsize=(8, 8))
+            ax = sns.heatmap(
+                effects,
+                annot=np.fix(-np.log10(pvalues)),
+                yticklabels=order,
+                xticklabels=order,
+                cbar=True,
+                square=True,
+            )
+            ax.xaxis.set_ticks_position("top")
+            plt.savefig(f"stats-{name}-{timestamp}.png")
+            save_obj(
+                dict(effects=effects, pvalues=pvalues, order=order),
+                f"stats-{name}-{timestamp}.pkl",
+            )
+
     # create SHAP summary csv and figures
     if gen_shap:
-        reg_metrics=["explained_variance_score","max_error",
-            "mean_absolute_error","mean_squared_error",
-            "mean_squared_log_error","median_absolute_error",
-            "r2_score","mean_poisson_deviance",
-            "mean_gamma_deviance"
-            ]
+        reg_metrics = [
+            "explained_variance_score",
+            "max_error",
+            "mean_absolute_error",
+            "mean_squared_error",
+            "mean_squared_log_error",
+            "median_absolute_error",
+            "r2_score",
+            "mean_poisson_deviance",
+            "mean_gamma_deviance",
+        ]
         if any([True for x in metrics if x in reg_metrics]):
-            gen_report_shap_regres(results, output_dir=output_dir, plot_top_n_shap=plot_top_n_shap)
+            gen_report_shap_regres(
+                results, output_dir=output_dir, plot_top_n_shap=plot_top_n_shap
+            )
         else:
-            gen_report_shap_class(results, output_dir=output_dir, plot_top_n_shap=plot_top_n_shap)
-
+            gen_report_shap_class(
+                results, output_dir=output_dir, plot_top_n_shap=plot_top_n_shap
+            )
diff --git a/pydra_ml/tasks.py b/pydra_ml/tasks.py
@@ -23,7 +23,7 @@ def read_file(filename, x_indices=None, target_vars=None, group=None):
     if group is None:
         groups = list(range(X.shape[0]))
     else:
-        groups = data[:, [group]]
+        groups = data[group].values
     feature_names = list(X.columns)
     return X.values, Y.values, groups, feature_names
 
diff --git a/pydra_ml/tests/test_classifier.py b/pydra_ml/tests/test_classifier.py
diff --git a/short-spec.json.sample b/short-spec.json.sample

Original file line number	Diff line number	Diff line change
`@@ -55,6 +55,7 @@ def gen_workflow(inputs, cache_dir=None, cache_locations=None):`
`55`	`55`	`filename=wf.lzin.filename,`
`56`	`56`	`x_indices=wf.lzin.x_indices,`
`57`	`57`	`target_vars=wf.lzin.target_vars,`
	`58`	`+ group=wf.lzin.group_var,`
`58`	`59`	`)`
`59`	`60`	`)`
`60`	`61`	`wf.add(`