Skip to content

Conversation

@matham
Copy link
Contributor

@matham matham commented May 25, 2025

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?

The main reason is that the default values for the split parameters are not always working and need to be tunable. I was trying to properly tune detection so I can test classification, and to do that I needed to also tune those unexposed parameters.

In particular this is important when there are lots of cells concentrated in a single area and so you need to set the standard deviation threshold low enough to get all of them, but then it creates clusters of them which need to be split. And 3d filtering during splitting a cluster requires slightly different values than the initial 3d filtering that created the cluster. In particular being able to run iterations of erosion of the volume can help split these clusters. And so it must be tunable.

What does this PR do?

  1. Added the split_ parameters to napari.

  2. Removed the split_soma_diameter because it is redundant with soma_spread_factor as explained here.

  3. Changed the split_ parameters to use default values that are microns instead of voxels so it's consistent with the other parameters. This doesn't change the numbers used only the default values in the function parameters.

  4. Separated the classification and detection batch sizes because e.g. on my machine a batch of just 10 detection planes can fit in the GPU, but 128+ classification cubes can fit.

  5. Added same / similar docstrings to all the various main functions and napari. Also re-ordered the docstrings so it matches the order.

  6. In napari, I re-ordered the detection parameters so that it flows in the order of how filtering happens. I.e. 2d filtering -> 3d filtering -> cluster splitting.

  7. I added a file under benchmarks (filter_debug.py) that runs through the full filter stack and saves the output generated by each stage to disk so that you can visualize what the stages do and how changing parameters affect it. Because it's hard to tune all these parameters when you only have the final output (candidate cells), which takes a bit to run and you have to guess at how these input parameters actually affect the outcome.

    There's already something like this for saving the output of the 3d filtering step, but having the 2d step output in particular is helpful. It would be nice to be able to visualize all these debug outputs directly in napari, but that would require a lot more work and possibly not worth the complexity. Although I do wonder how most users would approach this tuning stage without being able to see how the parameters change the output. And I suspect something like this would be invaluable when tuning for new samples.

    This should also be helpful for generating images of the output for docs to show what each of the algorithm steps are doing.

References

#464 (comment)

How has this PR been tested?

Test suite plus with some samples.

Is this a breaking change?

I did drop the split_soma_diameter parameter as explained. Plus I renamed the batch_size for classification_batch_size and detection_batch_size. I'm not sure if workflows need to be updated? I'm assuming we need to similarly expose these parameters there. Is there a single place where they are added or is there multiple places?

Does this PR require an update to the documentation?

These changes should be propagated to the public docs, but this PR doesn't add them new.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality (unit & integration)
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Here's example output from:

input
Input
enhanced
After the laplacian/gaussian filters
2d_filter
After 2d filter
3d_filter
After 3d filter
type
Colored based on cell, needs to split, and artifact
type_split
Same as above except the split cells have a sphere colored in

@matham
Copy link
Contributor Author

matham commented May 26, 2025

We should hold off on merging this because I'm not actually sure splitting is the way to go.

I'm not sure why the splitting stuff was added originally. In my case I needed it because we have a whole range of cell intensities. You can kind of see it in the pic above. So you want to pick a threshold that captures even the weaker cells, so you pick a low threshold. But then, around clumps of very bright cells, all these bright cells and their areas are likely above the threshold so the whole area looks like a single structure. You can see this happening above.

I tried using the splitting parameters to fix this. So presumably, after splitting these bright cells get segmented. The problem is that it's not really working. Because, spitting assumes the cells look like circles next to each other which can be eroded into singular cells. But that's not the case. Instead the whole area is lit up. So when splitting, we're just eroding the boundary around the whole cluster. I tried testing parameters but ultimately it didn't work very well.

Additionally, because the number of splitting iterations has defaulted to 10, after 10 iterations I think the clusters are pretty much mostly eroded anyway so I'm not sure this helped in the past for people using it naively. But I suppose if you have uniform and large cells this may actually have properly segmented them.


My first thought was to add the 2d filtering step to the splitting process. So that it's not only doing 3d (like now). By 2d filtering, the laplacian/gaussians would potentially separate the cells so ball filtering would operate on actual cell edges, not just the overall cluster edges.

But, the problem essentially is because of the tension in the standard deviation (SD) parameter. Because you need to set it low enough above the mean to get the dark cells, but then all the areas around the bright cells get clustered as one cell. And if you instead set it high, you lose all the dark cells. Running it twice with different values potentially will lead to duplicates.

So, my thought was to make the mean/SD dynamic. I.e. instead of a single mean/SD value for each plane, tile it and use the local mean/SD value for the threshold. This way bright areas will have a higher mean and so the bright cells will be separable, and in dark areas we'll have a lower mean and we'll get the dark cells.

I already needed something like this and actually tried adding this. Because we tile our acquisition and use lasers from each side. Due to microscope idiosyncrasies these lasers may have different power levels. So a single per-plane mean/SD was not working when they were quite different, because it'd only find cells in the brighter half. But dynamic mean/SD didn't work; suddenly we got a bunch of cells outside the brain because the mean was super low. So I gave up and assumed we'll need to re-acquire those samples.

But seeing splitting was not working, I tried this version of dynamic threshold; the threshold = max(the single per-plane mean/SD, tile mean/SD). This worked really well. Because outside the brain, the threshold was above the darkness. In dark areas similarly we got cells above this threshold - this is the cells you set the threshold to. In bright areas, the threshold became higher due to the max, preventing the giant clustering. With some ball filter / filter width tuning I got good results, as you can in the examples below.


So, I'm thinking maybe leave the split parameters out for now from napari (but keep the changes in the docs etc in the PR, just don't show them to users). And add a tiling parameter for the dynamic threshold. If it's zero, the default, do like before. If set to a value, which will be a multiple of the soma value, use that to tile like described. For reference, I used a value of 50 for this, with 50% overlap during tiling so the mean/SD don't suddenly jump.


Examples:

In the dark area both approaches are similar. In the bright areas, dynamic thresholds helps get more accurate cells.

dark_static
Dark area static mean/SD
dark_dynamic
Dark area dynamic mean/SD
bright_static1
Bright area 1 static mean/SD
bright_dynamic1
Bright area 1 dynamic mean/SD
bright_static2
Bright area 2 static mean/SD
bright_dynamic2
Bright area 2 dynamic mean/SD


Previous images used a threshold of 1. Changing it to 0.5 to get more of the dark cells we get these, which show a similar pattern:

dark_static
Dark area static mean/SD
dark_dynamic
Dark area dynamic mean/SD
bright_static1
Bright area 1 static mean/SD
bright_dynamic1
Bright area 1 dynamic mean/SD
bright_static2
Bright area 2 static mean/SD
bright_dynamic2
Bright area 2 dynamic mean/SD

@adamltyson
Copy link
Member

adamltyson commented May 27, 2025

Because, spitting assumes the cells look like circles next to each other which can be eroded into singular cells.

I think previously the assumption was that if entire cells were labelled, then cells dendrites would be what connects them to each other in the image, and this could be more easily eroded.

Additionally, because the number of splitting iterations has defaulted to 10, after 10 iterations I think the clusters are pretty much mostly eroded anyway so I'm not sure this helped in the past for people using it naively.

IIRC, the remaining cells are recorded at each iteration, so even if they get eroded to nothing, they still "count".

But I suppose if you have uniform and large cells this may actually have properly segmented them.

Yep, I think this is much closer to the case of the original cellfinder use cases.

Generally, I think a lot of assumptions are now no longer valid now that cellfinder is being used for other types of data. Nuclear cfos LSFM data looks very different to whole-cell rabies tracing serial 2p data (the original application). We're not wedded to any of the current implementation though, so if there's a better way, that's great.

@matham
Copy link
Contributor Author

matham commented May 28, 2025

So what do you think about the global/local thresholding approach? I can convert this PR into that and keep the splitting parameters out of napari, unless splitting is useful in some applications and you'd like to expose it in napari and then I can keep the PR as is and make a new PR with the threshold stuff?

As I described above, the global threshold would be used to find the darkest cells, and potentially can even do what the bright tiles algorithm is meant to do. Since that can be used to eliminate any voxels that are below that threshold, which regions outside the brain are. And the local tiled threshold would help further define cells in bright regions. You'd need one, potentially two new parameters; the tile size and a second threshold value (I'm not sure if that would be needed, but perhaps tuning the tiled threshold would be helpful). And to keep the old behavior you'd just use the global one by setting tile size to zero.

I'm not suggesting to drop the bright tiling algorithm that determines in/out of brain areas currently in use, but this could be an alternate way of doing the same thing, of exuding cell looking things outside the brain.

Copy link
Member

@alessandrofelder alessandrofelder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @matham - excited about these improvements.

I am finding it hard to wrap my head around all these at once, so I've opened #535 to help me structure my thoughts.

I've made some initial tiny comments while reading the code for the first time in more depth.

So what do you think about the global/local thresholding approach?

Here I was wondering whether the global/local thresholding could be done as a preprocessing step?

Generally, we are happy with changes as long as we're sure cellfinder still works well on original application data.

n_free_cpus : int
How many CPU cores to leave free.
voxel_sizes : 3-tuple of floats
Size of your voxels in the z, y, and x dimensions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in microns? or in pixels?

Size of your voxels in the z, y, and x dimensions.
network_voxel_sizes : 3-tuple of floats
Size of the pre-trained network's voxels in the z, y, and x dimensions.
batch_size : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have "sensible", conservative defaults (they don't need to be perfect) especially for more computational options (maybe batch_size=1 and pin_memory=False?) because I think the prospect of having to check GPU/CPU memory and understanding what pinning memory means in this context will scare off users.

max_workers: int
The number of sub-processes to use for data loading / processing.
Defaults to 8.
pin_memory: bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be made a keyword argument?

@matham
Copy link
Contributor Author

matham commented Jun 6, 2025

This PR as is is differently doing too much at once, mainly because I started using the branch to stage all the related stuff so I can test it more easily. So I will break it off into 3+ PRs?

  1. Make the docs consistent and add the split parameters everywhere (mostly main functions), except in napari so they're not exposed to end users there.
  2. Add benchmarks/filter_debug.py for testing the filters.
  3. Expose pin memory and batch size with good defaults everywhere and cos. But also expose them in napari.
  4. Make WIP PR for the local threshold changes.

This seem ok? I'd love to combine 1 and 3 because 3 will git conflict with 1 since they both touch similar lines of code. But I can also brake the branch of 3 off 1 so when 1 is merged it should be ok with 3 in terms of git?

@alessandrofelder
Copy link
Member

Sounds like a good plan!

But I can also brake the branch of 3 off 1 so when 1 is merged it should be ok with 3 in terms of git?

Yep - 1 should be straightforward to review, and I can prioritise it, so we avoid conflicts.

@matham
Copy link
Contributor Author

matham commented Jun 8, 2025

See #542, #544 and #545.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants