Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit workflow arguments #162

Closed
tsalo opened this issue Nov 29, 2018 · 15 comments
Closed

Audit workflow arguments #162

tsalo opened this issue Nov 29, 2018 · 15 comments
Labels
discussion issues that still need to be discussed

Comments

@tsalo
Copy link
Member

tsalo commented Nov 29, 2018

In addition to the arguments we're considering modifying or adding in #161, I think we should audit several of the other arguments in the workflow.

FastICA parameters

Now that we're using sklearn for the ICA, we get a warning in cases of non-convergence instead of an error. Overall, I think this is a good thing, but it also obscures potential problems. The two arguments to sklearn.decomposition.FastICA that are relevant are max_iter (maximum number of iterations) and tol (convergence tolerance). The default for tol is 0.0001, while our default is 0.000025, and is set by the conv argument. My questions here are, is conv something anyone ever plays with (i.e., do we actually need an argument) and is there a reason to use the much stricter default threshold we have over the default from sklearn? If not, then I propose that we drop the conv argument from tedana and let sklearn use its default in tedica.

On a related note, the default for maximum number of iterations is 200, while mdp's was 5000, IIRC. The problem is that FastICA is failing to converge for both of our test datasets (Cornell 3-echo rest and NIH 5-echo task). I think we should probably increase max_iter when we call FastICA to at least 1000, which will probably increase the amount of time tedica takes, but will probably also still be faster than mdp.

Other workflow parameters

I think that --filecsdata and --strict are currently unused. Unless we plan to re-implement them, I think we can drop them.

Does anyone use --sourceTEs, --denoiseTEs, or --stabilize?

  • I don't recall anyone ever talking about sourceTEs or denoiseTEs, but neither seems very useful, unless I'm misunderstanding them.
  • sourceTEs: While I maybe understand switching between using catd vs optcom for fitting models (though I've never tried the former), if we wanted to fit to just some of the echoes, why not just feed those echoes into the workflow?
  • denoiseTEs: This just determines whether additional files are written out. Unless they're regularly used, I propose that we incorporate this switch into a general verbose argument as we've been discussing in Reduce intermediate outputs #17.
  • Per Should PCA component selection leverage multi-echo information? #101, it looks like we will be making MLEPCA the default (with no PCA decision tree), which means that stabilize won't have an impact on the default pipeline. I also don't know if it has any meaningful impact on the results in most cases when the decision tree is used. I propose that we simply treat it as a variant of the decision tree. So, when we do split the decision tree from MLEPCA, we will have some argument like --tedpca that can take values of mle, kundu, and kundu-stabilize, with mle being the default. Does that sound good?
@tsalo tsalo added the discussion issues that still need to be discussed label Nov 29, 2018
@handwerkerd
Copy link
Member

Just glancing at the code, it looks like --filecsdata and --strict were post meica v2.5 additions so it seems like, whatever they did may have been removed when the code was revised back towards v2.5. I don't have time right now to dig into what inforation was saved with --filecsdata, but if we're rethinking how to record component selection information, I concur that it's better to focus on doing this right than keeping a specific option label.

I haven't used --sourceTEs, but depending how it's implemented, I could see it being a useful option. For something like BIDS compliant data where you don't need to give it every echo's file name, it would be good to have an option that lets you run the algorithm using just a subset of the acquired echoes. I don't know how this option is implemented in practice, but I think it's a reasonable bit of functionality to keep.

Given that denoiseTEs and stabilize both of opaque functionality based on their names and seem to be partially subsumed by other options, I don't have any particular attachment to them.

@tsalo
Copy link
Member Author

tsalo commented Nov 30, 2018

@handwerkerd Thanks for your input. I've opened a new PR with relevant changes (#163), but will keep --sourceTEs untouched.

Also, given that we want to make MLE the default component selection method for TEDPCA, we should revisit kdaw and rdaw as well, since they're only used for the decision tree approach. I don't know if we want arguments for parameters for a non-default method only non-power users will use.

@tsalo
Copy link
Member Author

tsalo commented Dec 1, 2018

It actually seems like --sourceTEs is not equivalent to inputting a subset of the echoes to the full pipeline. The first time it's used is in tedpca, so all echoes are used for T2* estimation and optimal combination, as well as to fit dependence models in component selection. When --sourceTEs is used, tedpca is performed on the concatenated data for just the selected echoes, and returns a dimensionally reduced version of that concatenated data. Then, tedica seems to be performed on the dimensionally reduced concatenated data, rather than on a dimensionally reduced version of the optimally combined data (default) or of the full concatenated data (when --sourceTEs is set to -1).

It would be nice to know what running tedpca and tedica on concatenated data rather than optimally combined data is. Does anyone know what the conceptual basis is for the former (given that the latter is the default)?


On another note, in terms of revisiting rdaw and kdaw, how important are these arguments when using the decision tree version of TEDPCA? @dowdlelt and @handwerkerd, I believe that you two know the most about these arguments.

If removing the arguments is no good, what about combining them into a single, comma-separated argument? At minimum, we can place those arguments in their own section of the CLI argument documentation, to make it clear that they're only used if the decision tree is used.

@dowdlelt
Copy link
Collaborator

dowdlelt commented Dec 2, 2018

In my experience kdaw has a particularly dramatic impact on convergence. I primarily experimented with kdaw, given that it was easily modifiable when calling the function. In cases where the default kdaw resulted to far too many components (my data consist of >450 timepoints) I could reduce the default value to reduce the number of components. Sometime the problem would be the inverse. I did not like this fiddlelyness. The big reason is that I don't have a good understanding for what the parameters do, exactly. I also don't like subject specific, arbitrary choices in data processing.

Is it correct that they determine what the cut off will be for the thresholding on the PCA maps, for determining the number of PC components to keep for ICA? I base that on commit comments from bitbucket here
-Reduced default kdaw to 5, and rdaw to 0 -New rdaw setting of 0, which is fixed F(1,ne-1) value correspoding to p<0.05

I am very much looking forward to not using them. That said, I think a single comma-separated parameter that can be added if one is going to all of the trouble of using TEDPCA instead of the new defaults.

@emdupre
Copy link
Member

emdupre commented Dec 2, 2018

Sorry for being so long-delayed to this thread, and thanks for starting it @tsalo !! A few thoughts:

  • I am a big fan of cleaning up the workflow arguments, especially since many of them do not align with our milestone of transparency (as evidenced by the confusion even in these discussions). To recap what I think has been agreed on so far:

    • dropping the conv argument and using sklearn defaults
    • dropping the --strict argument, since we've reverted to the v2.5 decision tree
    • dropping the --filecsdata argument, which would explicitly write out the component selection data. The previous default was False, but I believe our stance is that the default should be True so having this argument and verbose is confusing

    It also sounds like we're converging on:

    • dropping the --kdaw and --rdaw flags, both of which were dimensionality augmentation ratings for the two parameters and very difficult to interpret in a systematic fashion
  • I'm revisiting the discussion in Should PCA component selection leverage multi-echo information? #101 and now I'm not sure that I'd like to keep the decision tree 😆 But assuming we do, I think I'd rather look at why --stabilize is used, since it only seems to add a few more rejection criteria. Personally, I'm not sure it would merit its own tree... I'd like to see it instead incorporated into an improved TEDPCA decision tree. Here, of course, the question is how to test those decisions :)

  • Re --sourceTEs, my understanding is something like this:

    • there is no benefit to dropping echos from the T2* map calculation, since you're penalizing yourself (and it may not even be possible to calculate a T2* map if you drop too many echos).
    • But, when you sub-select TEs for the component generation, then you're trying to find the dimensions that are shared across those components -- which you can evaluate based on thresholds from the T2* map.

    This brings up the question of why you'd be interested in dimensions shared across some echos and not others. I've never used this option myself, so I'd have to think on it a bit longer, but not sure if someone who has experience with the option (or a better intuition for it) wants to chime in !

@tsalo
Copy link
Member Author

tsalo commented Dec 2, 2018

@dowdlelt Yes, I think that they operate like that. I believe that increasing kdaw (rdaw) should decrease the Kappa (Rho) threshold, which in turn will increase the number of significant components from the PCA to keep for ICA. This is based on the following:

kappa_thr = np.average(sorted([fmin, getelbow(ct_df['kappa'], return_val=True)/2, fmid]),
weights=[kdaw, 1, 1])

First, the three threshold sources (fmin, the Kappa elbow, and fmid) are sorted in ascending order, so the weight for averaging from kdaw corresponds to the smallest of the three. If kdaw is greater than 1, then the smallest value will have the highest weight, leading to a lower Kappa threshold.

Of course, -1 is a special value for both kdaw and rdaw, but I don't quite understand the math that's used in that case. I also noticed that using negative values for weights in numpy.average gives nonsensical results (as discussed here), so we should probably check the argument range in the function.

Perhaps just collecting kdaw and rdaw in a decision tree-specific subgroup is enough.


@emdupre

  • I think that we need the decision tree option as a template for treating tedana as the development sandbox we want it to be. Besides, in the (albeit limited) test datasets we're using, MLE doesn't do anything. The original argument in Should PCA component selection leverage multi-echo information? #101 is that the decision tree is returning too many components, but I'm seeing the opposite.
    • In any case, I think we should wait until we've done some quantitative comparisons in the tedana-comparison repository (possibly associated with a paper corresponding to a major release of tedana) to drop the decision tree, since the decision tree is at least citable, unlike v3.2 of TEDICA.
  • With the way I've set up the code in [ENH] Split automatic dimensionality detection from decision tree in TEDPCA #164, it would be easy enough to drop the stabilize option.
  • To be honest, your description of sourceTEs is going over my head a bit. Do you know what one might get from using the concatenated data vs. the optimally combined data there (totally ignoring the possibility of selecting a subset of the echoes)?

@tsalo
Copy link
Member Author

tsalo commented Dec 6, 2018

I think that everything discussed here so far has been handled by #163 and #164. The only other thing I was thinking we could work on is simplifying how manually accepted components are handled. For starters, I propose that we could move --manacc, --ctab, and --mix into a subgroup like "Arguments for re-running tedana" or something.

A more aggressive change (that I'm somewhat in favor of) would be to drop --ctab and --mix. Since the output files follow a naming convention, we know what the component table and mixing matrix files will be named when someone runs tedana twice on the same data. In cases where a user provides --manacc, tedana could just grab the component table and mixing matrix from the output directory, as long as they exist. I guess it just depends on what workflow people prefer. Would they rather run tedana on their input data with a label like "initial", manually select components, and then re-run with a label like "final" (which will generate some, but not all, of the tedana outputs, or would they like to just run tedana twice with the same label. We can always add relevant information to the logger to make it clear that files are being re-generated, so I don't see much of a cost to the latter approach.

@jbteves
Copy link
Collaborator

jbteves commented Jan 17, 2019

Hi all, trying to track some of the changes made between BitBucket 3.2.2 version and this one. Is the current version robust to kdaw and rdaw values? And if not, see ME-ICA/me-ica#10 at some point the default kdaw and rdaw were changed from there to here. (Seemed like the best place to put this comment-- will move if wrong).

@emdupre
Copy link
Member

emdupre commented Jan 17, 2019

Thanks for confirming ! We've removed the kdaw and rdaw flags in 69624dc, in line with our Transparent and Reproducible Processing milestone (since it was difficult for users to form opinions on how / when to adjust those values).

@tsalo
Copy link
Member Author

tsalo commented Jan 17, 2019

The default values of 10 and 1 for kdaw and rdaw, respectively, are now hardcoded here when users use the kundu option for --tedpca (which is no longer the default). Those values appear to be the same as the ones used in the most recent version of ME-ICA, which was updated about one year ago.

Our rationale for hardcoding these values is, as @emdupre said, because they're difficult to understand for users and because we've changed the default tedpca method to MLE dimensionality estimation. That said, power users could go into the code and edit these values as needed.

@jbteves
Copy link
Collaborator

jbteves commented Jan 17, 2019

Ah, I see, the values I have are from a different branch than master, v3.2 seen here. Has the project decided to retain only behavior from v3? I realize that the project has decided to not retain backwards compatibility for ME-ICA, but I thought it would be good to make a record of why this divergence exists.

@tsalo
Copy link
Member Author

tsalo commented Jan 17, 2019

We don't support v3, although we may merge that back in in the future (as one option, not as the only method). We have shifted to only supporting v2.5 of the component selection algorithm. We have that information, and why we shifted, here in the FAQ section of the documentation site, although it's a fairly new addition.

@jbteves
Copy link
Collaborator

jbteves commented Jan 17, 2019

Okay, I see, sorry about all that.

@emdupre
Copy link
Member

emdupre commented Jan 17, 2019

Thanks for checking on it ! It's good to have more people thinking through these choices, too :)

@tsalo
Copy link
Member Author

tsalo commented Feb 9, 2019

I think that we can close this issue now. We may ultimately want to discuss removing or reorganizing --manacc, --ctab, and --mix, but I think that that would be better dealt with in a separate issue in order to prevent bloat.

@tsalo tsalo closed this as completed Mar 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion issues that still need to be discussed
Projects
None yet
Development

No branches or pull requests

5 participants