Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement Pydra code-generators #2665

Draft
wants to merge 116 commits into
base: dev
Choose a base branch
from
Draft

Conversation

tclose
Copy link
Contributor

@tclose tclose commented Jun 29, 2023

Implementation of #2605

Tasks:

  • __print_pydra_usage__ for C++ commands.
  • __print_pydra_usage__ for Python commands
  • Script to loop through all cmds and print their "pydra usage" to *.py files
  • Fix up package TOML (maybe switch to hatchling?)
  • GitHub action to be run on new release to upload pydra-mrtrix3 package to PyPI
  • Tests to check generated code can print cmdlines properly (@Lestropie, do you have any ideas for this? I was thinking if there was a hidden "dry-run" option then

Blocked by:

@Lestropie
Copy link
Member

As mentioned in #2605, Python will require #1392 via ankitasanil#1 to have typed command-line parameters.
@ankitasanil: Still interest in completing the PR, or will @MRtrix3/mrtrix3-devs need to finish it off?

@Lestropie Lestropie marked this pull request as draft June 29, 2023 04:44
@tclose
Copy link
Contributor Author

tclose commented Jun 30, 2023

As mentioned in #2605, Python will require #1392 via ankitasanil#1 to have typed command-line parameters. @ankitasanil: Still interest in completing the PR, or will @MRtrix3/mrtrix3-devs need to finish it off?

I have written the Python version as best as I can with the current level of typing, but you are right that it will need more type information to get it to work in a guaranteed way.

On typing of files, how many different formats does MRtrix interact with (e.g. bvec, bval, response.txt, etc...) and would it be too much to ask to be able to give each format its own ArgType (like Image* and Track* do currently)? Because I have a PR that refactors Pydra's type-checking to be aware of file-formats (nipype/pydra#662), and it would be great to be able to specify them exactly instead of defaulting to a generic FileSet class.

@Lestropie
Copy link
Member

how many different (file) formats does MRtrix interact with

Assuming upstream exclusion of images & tractograms from that set, here's what comes to mind:

  • 1D & 2D numerical data input & output (currently ASCII only but also NPY after Npy file format support #2437)
  • TSFs
  • JSONs (in multiple contexts)
  • .dcm files
  • Other text files:
    • Lists of paths to other files (eg. stats command inputs)
    • Generic tabulated data (eg. parcellation lookup tables but others also)

As far as further custom types in user interfacing, I would argue that:

  • That could potentially be the case for TSFs / JSONs / .dcms, since those represent singular unambiguous cases of file extension / content of data / requisite parsing code / interpretation of data.
  • The specific examples you provide (bvec / bval / response functions) I'm more sceptical of, as they are just numerical data, albeit sometimes with expectations of dimensionality / size; they're just subsets of "Nd numerical data" to which certain interpretations are applied after the data are parsed.

Curious to hear others' thoughts.

@tclose
Copy link
Contributor Author

tclose commented Jun 30, 2023

how many different (file) formats does MRtrix interact with

Assuming upstream exclusion of images & tractograms from that set, here's what comes to mind:

  • 1D & 2D numerical data input & output (currently ASCII only but also NPY after Npy file format support #2437)

  • TSFs

  • JSONs (in multiple contexts)

  • .dcm files

  • Other text files:

    • Lists of paths to other files (eg. stats command inputs)
    • Generic tabulated data (eg. parcellation lookup tables but others also)

As far as further custom types in user interfacing, I would argue that:

  • That could potentially be the case for TSFs / JSONs / .dcms, since those represent singular unambiguous cases of file extension / content of data / requisite parsing code / interpretation of data.
  • The specific examples you provide (bvec / bval / response functions) I'm more sceptical of, as they are just numerical data, albeit sometimes with expectations of dimensionality / size; they're just subsets of "Nd numerical data" to which certain interpretations are applied after the data are parsed.

Curious to hear others' thoughts.

Yes, maybe file formats is a bit of a misnomer, data types would be more accurate/actually what is captured in Pydra. In any case, it would be very helpful on the Pydra code-gen front if these file-data-types could be specified at bit finer granularity

@tclose
Copy link
Contributor Author

tclose commented Jul 12, 2023

Ok, so this just needs some unit tests to test out the generated Pydra interfaces and to wait for #1392.

@Lestropie, on the unit tests, do you think it would be an issue if I added a --dry-run option to the general app options, which basically tests to see whether the provided options parse and then exits?

@tclose
Copy link
Contributor Author

tclose commented Jul 12, 2023

Ok, so this just needs some unit tests to test out the generated Pydra interfaces and to wait for #1392.

@Lestropie, on the unit tests, do you think it would be an issue if I added a --dry-run option to the general app options, which basically tests to see whether the provided options parse and then exits?

If you don't want to clutter the help menu further, maybe there could be special configure arg to insert it in?

@Lestropie
Copy link
Member

One note I might throw in here just because I encountered it while working on the code and it's of some limited relevance.

For a command-line option, in C++, you specify the option, and then you can specify some number of compulsory command-line arguments that must follow that option, and get consumed from the command-line by the parser immediately following the appearance of the option. Each of those arguments may have its own type.

In Python, this can't currently be done. In argparse, the option itself is assigned a type, as well as a number of arguments. Therefore you can't have >1 arguments following a command-line option where those arguments have different types to one another.

@tclose tclose force-pushed the pydra-usage branch 3 times, most recently from cb2e711 to 0dff23a Compare August 22, 2023 07:44
@daljit46
Copy link
Member

daljit46 commented Aug 24, 2023

Please note that this PR will need to conform to the formatting requirements introduced in #2652, thus all C++ files need to be formatted using clang-format (version 16). See here.

@tclose
Copy link
Contributor Author

tclose commented Aug 25, 2023

In Python, this can't currently be done. In argparse, the option itself is assigned a type, as well as a number of arguments. Therefore you can't have >1 arguments following a command-line option where those arguments have different types to one another.

I think the typical way to represent this is to use the typing.Tuple[MyType1, MyType2] form

@Lestropie
Copy link
Member

I think the typical way to represent this is to use the typing.Tuple[MyType1, MyType2] form

Having had a small taste of Pydra now, this might be better than doing the most permissible thing of just accepting any string in such a scenario, but doesn't fully address the issue to the extent that the MRtrix3 C++ interface does, at least to the best of my understanding.

Say for instance a command-line option takes as input a string and then an integer. Yes you could make that command-line option a typing.Tuple[str, int], but that would mean that the user could erroneously specify an integer and then a string, and that wouldn't be caught by the command-line parser.

PS. Relevance to #2580: making the argument types in C++ a bitwise flag indicating what set of types is permissible would be faithful to the typing.Tuple[MyType1, MyType2] usage I've seen.

@Lestropie Lestropie mentioned this pull request Feb 19, 2024
@Lestropie
Copy link
Member

@tclose: After discussion with @jdtournier, there is trepidation about embedding the Pydra interfaces within the repository.

On the plus side, it would enforce consistency, PR changeset would show interface changes that might otherwise be invisible to the CI checks of the online documentation, and be convenient from the perspective of workflow managers grabbing both the software and the interface to it.

On the minus side, it's implicitly taking on management responsibility for interfaces to an environment for which none of the core development team have experience; even if the interfaces were to be committed in a functional state, and the MRtrix3 commands were to not change, it would nevertheless be possible for there to be changes to Pydra that would necessitate revisions here.

What would it look like from the Pydra perspective if the MRtrix3 Pydra interface were to be provided as a stand-alone repository? Ie. What would be the magnitude of inconvenience? There's obviously the issue that on a tagged release of MRtrix3 you would need to update that external repository and generate a corresponding tag.

@tclose
Copy link
Contributor Author

tclose commented Feb 28, 2024

@Lestropie @jdtournier just to clarify that the interfaces are not being stored, instead generated from the MRtrix C++ & Python command declarations (see __print_usage_pydra__). Therefore, the interfaces should evolve with changes to the MRtrix commands as long as the command definition syntax stays stable.

It would be great if the print_usage_pydra special command (or whatever it is called) can live in the MRtrix source, as I'm not sure it will be easy to implement/maintain that in a separate code-base (if possible at all). However, the generation code and Python package can be stored in a separate repository if required. This is what has been typically done for other toolboxes.

We are hoping to move to a model where the Pydra interfaces could be integrated into the tool source repos to make it easier for them them stay in step with each other. Given it is my code, I'm happy to put my hand up to deal with any issues that might arise, as I would need to do this if it were stored in a separate repository. But in any case, totally up to you guys, I can integrate the changes into the existing nipype/pydra-mrtrix3 repo instead.

@jdtournier
Copy link
Member

Hey @tclose!
Just a few points/questions:

No problem at all from our point of view to include the code responsible for producing the interface definitions. That would indeed be really tricky to maintain outside of the core repo, but it's not something that I expect to change much once implemented, and it makes sense for it to be there. I would opt for a name such as __print_usage_json__ (assuming Pydra does expect JSON...) just to avoid giving the impression that this can only be used in Pydra - if it's JSON, there's no reason it can't be used in other contexts as well .

The bigger question from my point of view is whether to store the generated definitions in the core repo. It's trivial to generate them at build-time, and maybe make it part of the CMake install or something. But that doesn't require the generated interface definitions themselves to also be stored on the main repo. What I was trying to figure out was how you envisage these files to be used in practice. If they're stored alongside a full MRtrix install, it would be possible for an application like Pydra to query them at run-time and offer a version of the interface that is always guaranteed to be in lock-step with the version of MRtrix installed. If they're stored elsewhere, then there is always the risk of a mismatch between the version installed and the interface files, which could lead to odd errors. I expect with that kind of approach, you would need to host interface definitions for each version of MRtrix, along with the ability to determine the current version of MRtrix, and would only guarantee correct operation for official release versions. But the main reason I'm discussing this is because in the former case, I don't see a need to host these interface definitions on the main repo, while I might see a call for that with the latter approach (though I still think it would be cleaner from our point of view to have them hosted elsewhere, presumably somewhere more obviously related to the Pydra project).

What do you reckon?

@Lestropie
Copy link
Member

It might be the case that we need to consider separately:

  1. The implementation of __print_usage_pydra__
    (which without a doubt should be implemented within the main repository, even if it may one day need to be changed in response to Pydra changes)

  2. The storage of the outcomes of __print_usage_pydra__, which would extend Generate interfaces #2808, and potentially not require any more developer effort (eg. have a single script that generates both the interfaces and the online documentation), and CI / code review would potentially highlight CLI changes not exposed by help page changes.

  3. The additional content currently present in this PR.
    Current discussion with @jdtournier may have overlooked the fact that there's a lot of additional content here over and above the direct outputs from __print_usage_pydra__. Which if your concern is developer maintenance burden, this would exacerbate that quite a lot.

I wondered about an intermediate solution where we do 1 and 2 but not 3. However I don't think that that solves @jdtournier's concerns, and it might not make @tclose's handling any cleaner.

cmd/amp2response.cpp Outdated Show resolved Hide resolved
@tclose
Copy link
Contributor Author

tclose commented Feb 29, 2024

Ooops, I just realised that in between switching between the old build script and the cmake version I inadvertently committed my old pre-cmake build directory to the branch, which might have been confusing things. After speaking to Rob today we decided to strip everything out of this PR except to the print_usage_pydra special methods in favour of storing the package stub in https://github.com/nipype/pydra-mrtrix3. However, if you don't mind, I have kept in the GH action workflow which generates the interfaces and runs the tests with the PARSE_ONLY env var set.

The Pydra export is actually Python code rather than JSON so it will need to be specific to Pydra if that is ok. Also, the interface syntax is likely/hopefully going to change a fair bit in the near future if my proposal is accepted, so we can probably just leave this PR as a WIP until then (I can build and release the package on PyPI with this dev branch in the mean time) and the first cmake release.

@@ -57,7 +57,7 @@ void usage() {
ARGUMENTS
+ Argument ("amps", "the amplitudes image").type_image_in()
+ Argument ("mask", "the mask containing the voxels from which to estimate the response function").type_image_in()
+ Argument ("directions", "a 4D image containing the estimated fibre directions").type_image_in()
+ Argument ("directions_image", "a 4D image containing the estimated fibre directions").type_image_in()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to avoid a clash with an option of the same name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hid / resolved prior post about the same issue)

Issue here is that the argument "directions" and the command-line option "directions" encode two very different pieces of information; so disambiguating by calling one an "image" may not be the optimal change. I'm hesitant to change the command-line option, as there are many commands relating to "amp" images that have an option "-directions" (even if it's not currently centralised with a const Option in core/ / src/). Conversely, the functionality of the positional argument here also appears in other commands; so renaming it to something different only here would also be slightly unusual.

A few possibilities come to mind:

  1. Across various commands (thinking dwi2response in particular), where an image encoding a unit 3-vector per voxel is used, adopt "orientations" rather than "directions".

  2. While it's technically applicable to the data type in terms of command naming convention, "peaks" would likely not be a good choice here, since can't have multiple vectors per voxel, and the data may have been derived using some approach other than ODF peak-finding.

  3. Rename here (and preferably in other comparable instances also) to "fibre_directions".

  4. Create a function that returns an Option, creating the current "-directions" command-line option with a custom help string per use (I'm pretty sure from experience it's the considerable difference in these help strings that has caused there to not be centralisation of this command-line option). Once this works, change the command-line option name to something like "-amp_directions".
    (We might need to think carefully about our use of the header field "Directions" also...)

@Lestropie
Copy link
Member

  • Add test data and tests for new interface generation capability (both C++ and Python)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants