Add support for passing a union of runs to Scantool #327

JamesWrigley · 2025-04-28T14:38:52Z

This is for convenience when working with combined runs. The output looks like this:

The bulk of the changes are to support reading the scantool settings from CONTROL data instead of RUN data.

codecov · 2025-04-28T14:41:14Z

Codecov Report

Attention: Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.29%. Comparing base (f2ca821) to head (704af9c).

Files with missing lines	Patch %	Lines
src/extra/components/scantool.py	90.90%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #327      +/-   ##
==========================================
+ Coverage   60.18%   60.29%   +0.11%     
==========================================
  Files          30       30              
  Lines        4551     4564      +13     
==========================================
+ Hits         2739     2752      +13     
  Misses       1812     1812

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

philsmt · 2025-04-28T14:55:41Z

src/extra/components/scantool.py

        self._active = self.source["isMoving"].ndarray().any()
-        self._scan_type = values["scanEnv.scanType.value"]
+        self._scan_type = get_value("scanEnv.scanType.value", is_str=True)


Would KeyData.as_single_value() not work here?

That doesn't seem to work for strings, I get this error:

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[6], line 1 ----> 1 run["FXE_DAQ_SCAN[/MDL/KARABACON](https://max-jhub.desy.de/MDL/KARABACON)", "scanEnv.scanType.value"].as_single_value() File [~/src/EXtra-data/extra_data/keydata.py:338](https://max-jhub.desy.de/user/wrigleyj/lab/tree/notebooks/src/EXtra-data/extra_data/keydata.py#line=337), in KeyData.as_single_value(self, rtol, atol, reduce_by) 336 value = reduce_by(data) 337 elif isinstance(reduce_by, str) and hasattr(np, reduce_by): --> 338 value = getattr(np, reduce_by)(data, axis=0) 339 elif reduce_by == 'first': 340 value = data[0] File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:4001, in median(a, axis, out, overwrite_input, keepdims) 3916 @array_function_dispatch(_median_dispatcher) 3917 def median(a, axis=None, out=None, overwrite_input=False, keepdims=False): 3918 """ 3919 Compute the median along the specified axis. 3920 (...) 3999 4000 """ -> 4001 return _ureduce(a, func=_median, keepdims=keepdims, axis=axis, out=out, 4002 overwrite_input=overwrite_input) File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:3894, in _ureduce(a, func, keepdims, **kwargs) 3891 index_out = (0, ) * nd 3892 kwargs['out'] = out[(Ellipsis, ) + index_out] -> 3894 r = func(a, **kwargs) 3896 if out is not None: 3897 return out File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:4053, in _median(a, axis, out, overwrite_input) 4049 indexer = tuple(indexer) 4051 # Use mean in both odd and even case to coerce data type, 4052 # using out array if needed. -> 4053 rout = mean(part[indexer], axis=axis, out=out) 4054 if supports_nans and sz > 0: 4055 # If nans are possible, warn and replace by nans like mean would. 4056 rout = np.lib._utils_impl._median_nancheck(part, rout, axis) File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3860, in mean(a, axis, dtype, out, keepdims, where) 3857 else: 3858 return mean(axis=axis, dtype=dtype, out=out, **kwargs) -> 3860 return _methods._mean(a, axis=axis, dtype=dtype, 3861 out=out, **kwargs) File ~[/](https://max-jhub.desy.de/).conda/envs/scratch/lib/python3.12[/](https://max-jhub.desy.de/)site-packages/numpy/_core/_methods.py:147, in _mean(a, axis, dtype, out, keepdims, where) 145 ret = ret.dtype.type(ret / rcount) 146 else: --> 147 ret = ret / rcount 149 return ret TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

takluyver · 2025-04-28T16:09:14Z

Are the data you're pooling together several similar scans, or one scan split over multiple runs? Or something else? Even with the warning, I think this could be misleading - e.g. if you stick two scans together, should the number of steps be the total of the steps in each scan?

The bulk of the changes are to support reading the scantool settings from CONTROL data instead of RUN data.

This feels somewhat awkward. Do we want a more general mechanism to get these kinds of values - either from RUN or CONTROL - when multiple runs are stuck together? Maybe with a check that the retrieved values are consistent across the pooled data? Or maybe just using .as_single_value() and fixing it for strings? (I think it predates the DAQ storing strings in CONTROL, so we probably didn't consider that)

JamesWrigley · 2025-04-28T17:40:05Z

Are the data you're pooling together several similar scans, or one scan split over multiple runs? Or something else? Even with the warning, I think this could be misleading - e.g. if you stick two scans together, should the number of steps be the total of the steps in each scan?

I think Ivette's usecase was with several similar scans? (correct me if I'm wrong @bermudei)
In either case, I would expect the Scan component to not know or care about the fact that the data is coming from multiple runs. If it's several similar scans then ideally it would combine the trains where the motor is at the same position (total number of steps $\approx$ number of steps in one run), and if it's one scan across multiple runs then I would expect the number of steps to be the total of the steps in each run.

This feels somewhat awkward. Do we want a more general mechanism to get these kinds of values - either from RUN or CONTROL - when multiple runs are stuck together? Maybe with a check that the retrieved values are consistent across the pooled data? Or maybe just using .as_single_value() and fixing it for strings? (I think it predates the DAQ storing strings in CONTROL, so we probably didn't consider that)

Maybe? I dunno if it's worth it since AFAIK this is the only component where we had to look at run values, but if you think it'd be better I can have a go at implementing it.

bermudei · 2025-04-29T07:50:33Z

Well, in the example I did yesterday for users there were 2 similar scan runs(r152 and 153 in p7948).. See

But I think they also did slightly different ones. So, James comment above should work for both cases.
The reason why they needed this is for statistics...They couldnt really see any signal in a single run. It was tough :(

philsmt · 2025-04-29T08:32:18Z

As Thomas puts it, string values have only been in CONTROL for a few months now. I made a draft on how we might be able to fix this here: European-XFEL/EXtra-data#623

Concerning this PR, I'm hesitant to simply use the first value without any further check. This seems like baking in dangerous assumptions into a component that could break in mysterious ways. I agree when used properly this does the right thing, but that's not the situation I'm worried about. I'm afraid we risk the automagic behaviour we want to enable with components.

I fully agree with James that EXtra-data fundamentally motivates to treat multi-runs the same way as a single one. Code-wise however I would prefer to move any decision we make on how to treat this case upstream to EXtra-data, as it will come up more in other components. I already had cases where AdqRawChannel broke for multi-run DataCollection for the same reason. The .is_single_run() mechanism back then was meant as a first solution to not make it break in exactly the mysterious ways mentioned above.

Maybe it's time to properly handle this instead?

JamesWrigley · 2025-04-29T09:45:37Z

That's fair, though some properties like the acquisition time are stored as arrays, so as_single_value() would also need to be able to handle those. Alternatively, we could save the 'bounds' of each DataCollection that was unioned? That way we could check the settings at the first train of each sub-DataCollection, which would avoid having to load the data for all trains from disk. The downside is that it's not as robust as as_single_value() 🤔

philsmt · 2025-04-29T12:25:59Z

I was thinking in the direction of loading a RUN value for each file and do conflict resolution then. That should be cheap enough, a typical sequence is ~50s of measurement time.

tmichela · 2025-04-30T09:37:01Z

Could an alternative be that the Scantool component accepts a list of runs rather than a union, such that you can read the RUN values for each and validate that they do not diverge?
Maybe that's something we could even upstream to EXtra-data (expose RUN data which is indentical in each of the merged DataCollection / raise if we acccess one that diverge).

JamesWrigley · 2025-04-30T11:14:08Z

Isn't a list of runs what a union is supposed to be though? Like conceptually I think in these situations what people would want to use is a single object that represents all the runs that they can .select() on etc.

Maybe that's something we could even upstream to EXtra-data (expose RUN data which is indentical in each of the merged DataCollection / raise if we acccess one that diverge).

That could work 👍 And it would simplify this PR since we could stick to reading RUN data.

tmichela · 2025-04-30T12:03:07Z

Isn't a list of runs what a union is supposed to be though?

Conceptually yes, but technically a union merges DataCollection which in the end results in a single object with no knowledge of what a run is. And ending up by default to ignore RUN data if the unioned DataCollections are not from a single run. The suggestion I made merely help going around this limitation. Though I prefer the second option I mentioned.

Add support for passing a union of runs to Scantool

704af9c

JamesWrigley added the enhancement New feature or request label Apr 28, 2025

JamesWrigley requested review from takluyver and bermudei April 28, 2025 14:38

JamesWrigley self-assigned this Apr 28, 2025

philsmt reviewed Apr 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for passing a union of runs to Scantool #327

Add support for passing a union of runs to Scantool #327

Uh oh!

JamesWrigley commented Apr 28, 2025

Uh oh!

codecov bot commented Apr 28, 2025

Uh oh!

philsmt Apr 28, 2025

Uh oh!

JamesWrigley Apr 28, 2025

Uh oh!

takluyver commented Apr 28, 2025

Uh oh!

JamesWrigley commented Apr 28, 2025

Uh oh!

bermudei commented Apr 29, 2025 •

edited

Loading

Uh oh!

philsmt commented Apr 29, 2025

Uh oh!

JamesWrigley commented Apr 29, 2025

Uh oh!

philsmt commented Apr 29, 2025

Uh oh!

tmichela commented Apr 30, 2025

Uh oh!

JamesWrigley commented Apr 30, 2025 •

edited

Loading

Uh oh!

tmichela commented Apr 30, 2025

Uh oh!

Uh oh!

Add support for passing a union of runs to Scantool #327

Are you sure you want to change the base?

Add support for passing a union of runs to Scantool #327

Uh oh!

Conversation

JamesWrigley commented Apr 28, 2025

Uh oh!

codecov bot commented Apr 28, 2025

Codecov Report

Uh oh!

philsmt Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

JamesWrigley Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

takluyver commented Apr 28, 2025

Uh oh!

JamesWrigley commented Apr 28, 2025

Uh oh!

bermudei commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philsmt commented Apr 29, 2025

Uh oh!

JamesWrigley commented Apr 29, 2025

Uh oh!

philsmt commented Apr 29, 2025

Uh oh!

tmichela commented Apr 30, 2025

Uh oh!

JamesWrigley commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmichela commented Apr 30, 2025

Uh oh!

Uh oh!

bermudei commented Apr 29, 2025 •

edited

Loading

JamesWrigley commented Apr 30, 2025 •

edited

Loading