Skip to content

Conversation

JamesWrigley
Copy link
Member

This is for convenience when working with combined runs. The output looks like this:
image

The bulk of the changes are to support reading the scantool settings from CONTROL data instead of RUN data.

@JamesWrigley JamesWrigley added the enhancement New feature or request label Apr 28, 2025
@JamesWrigley JamesWrigley self-assigned this Apr 28, 2025
Copy link

codecov bot commented Apr 28, 2025

Codecov Report

Attention: Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.29%. Comparing base (f2ca821) to head (704af9c).

Files with missing lines Patch % Lines
src/extra/components/scantool.py 90.90% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #327      +/-   ##
==========================================
+ Coverage   60.18%   60.29%   +0.11%     
==========================================
  Files          30       30              
  Lines        4551     4564      +13     
==========================================
+ Hits         2739     2752      +13     
  Misses       1812     1812              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

self._active = self.source["isMoving"].ndarray().any()
self._scan_type = values["scanEnv.scanType.value"]
self._scan_type = get_value("scanEnv.scanType.value", is_str=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would KeyData.as_single_value() not work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to work for strings, I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 run["FXE_DAQ_SCAN[/MDL/KARABACON](https://max-jhub.desy.de/MDL/KARABACON)", "scanEnv.scanType.value"].as_single_value()

File [~/src/EXtra-data/extra_data/keydata.py:338](https://max-jhub.desy.de/user/wrigleyj/lab/tree/notebooks/src/EXtra-data/extra_data/keydata.py#line=337), in KeyData.as_single_value(self, rtol, atol, reduce_by)
    336     value = reduce_by(data)
    337 elif isinstance(reduce_by, str) and hasattr(np, reduce_by):
--> 338     value = getattr(np, reduce_by)(data, axis=0)
    339 elif reduce_by == 'first':
    340     value = data[0]

File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:4001, in median(a, axis, out, overwrite_input, keepdims)
   3916 @array_function_dispatch(_median_dispatcher)
   3917 def median(a, axis=None, out=None, overwrite_input=False, keepdims=False):
   3918     """
   3919     Compute the median along the specified axis.
   3920 
   (...)   3999 
   4000     """
-> 4001     return _ureduce(a, func=_median, keepdims=keepdims, axis=axis, out=out,
   4002                     overwrite_input=overwrite_input)

File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:3894, in _ureduce(a, func, keepdims, **kwargs)
   3891             index_out = (0, ) * nd
   3892             kwargs['out'] = out[(Ellipsis, ) + index_out]
-> 3894 r = func(a, **kwargs)
   3896 if out is not None:
   3897     return out

File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:4053, in _median(a, axis, out, overwrite_input)
   4049 indexer = tuple(indexer)
   4051 # Use mean in both odd and even case to coerce data type,
   4052 # using out array if needed.
-> 4053 rout = mean(part[indexer], axis=axis, out=out)
   4054 if supports_nans and sz > 0:
   4055     # If nans are possible, warn and replace by nans like mean would.
   4056     rout = np.lib._utils_impl._median_nancheck(part, rout, axis)

File ~/.conda/envs/scratch/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3860, in mean(a, axis, dtype, out, keepdims, where)
   3857     else:
   3858         return mean(axis=axis, dtype=dtype, out=out, **kwargs)
-> 3860 return _methods._mean(a, axis=axis, dtype=dtype,
   3861                       out=out, **kwargs)

File ~[/](https://max-jhub.desy.de/).conda/envs/scratch/lib/python3.12[/](https://max-jhub.desy.de/)site-packages/numpy/_core/_methods.py:147, in _mean(a, axis, dtype, out, keepdims, where)
    145         ret = ret.dtype.type(ret / rcount)
    146 else:
--> 147     ret = ret / rcount
    149 return ret

TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

@takluyver
Copy link
Member

Are the data you're pooling together several similar scans, or one scan split over multiple runs? Or something else? Even with the warning, I think this could be misleading - e.g. if you stick two scans together, should the number of steps be the total of the steps in each scan?

The bulk of the changes are to support reading the scantool settings from CONTROL data instead of RUN data.

This feels somewhat awkward. Do we want a more general mechanism to get these kinds of values - either from RUN or CONTROL - when multiple runs are stuck together? Maybe with a check that the retrieved values are consistent across the pooled data? Or maybe just using .as_single_value() and fixing it for strings? (I think it predates the DAQ storing strings in CONTROL, so we probably didn't consider that)

@JamesWrigley
Copy link
Member Author

Are the data you're pooling together several similar scans, or one scan split over multiple runs? Or something else? Even with the warning, I think this could be misleading - e.g. if you stick two scans together, should the number of steps be the total of the steps in each scan?

I think Ivette's usecase was with several similar scans? (correct me if I'm wrong @bermudei)
In either case, I would expect the Scan component to not know or care about the fact that the data is coming from multiple runs. If it's several similar scans then ideally it would combine the trains where the motor is at the same position (total number of steps $\approx$ number of steps in one run), and if it's one scan across multiple runs then I would expect the number of steps to be the total of the steps in each run.

This feels somewhat awkward. Do we want a more general mechanism to get these kinds of values - either from RUN or CONTROL - when multiple runs are stuck together? Maybe with a check that the retrieved values are consistent across the pooled data? Or maybe just using .as_single_value() and fixing it for strings? (I think it predates the DAQ storing strings in CONTROL, so we probably didn't consider that)

Maybe? I dunno if it's worth it since AFAIK this is the only component where we had to look at run values, but if you think it'd be better I can have a go at implementing it.

@bermudei
Copy link
Collaborator

bermudei commented Apr 29, 2025

Well, in the example I did yesterday for users there were 2 similar scan runs(r152 and 153 in p7948).. See
below

But I think they also did slightly different ones. So, James comment above should work for both cases.
The reason why they needed this is for statistics...They couldnt really see any signal in a single run. It was tough :(

@philsmt
Copy link
Collaborator

philsmt commented Apr 29, 2025

As Thomas puts it, string values have only been in CONTROL for a few months now. I made a draft on how we might be able to fix this here: European-XFEL/EXtra-data#623

Concerning this PR, I'm hesitant to simply use the first value without any further check. This seems like baking in dangerous assumptions into a component that could break in mysterious ways. I agree when used properly this does the right thing, but that's not the situation I'm worried about. I'm afraid we risk the automagic behaviour we want to enable with components.

I fully agree with James that EXtra-data fundamentally motivates to treat multi-runs the same way as a single one. Code-wise however I would prefer to move any decision we make on how to treat this case upstream to EXtra-data, as it will come up more in other components. I already had cases where AdqRawChannel broke for multi-run DataCollection for the same reason. The .is_single_run() mechanism back then was meant as a first solution to not make it break in exactly the mysterious ways mentioned above.

Maybe it's time to properly handle this instead?

@JamesWrigley
Copy link
Member Author

That's fair, though some properties like the acquisition time are stored as arrays, so as_single_value() would also need to be able to handle those. Alternatively, we could save the 'bounds' of each DataCollection that was unioned? That way we could check the settings at the first train of each sub-DataCollection, which would avoid having to load the data for all trains from disk. The downside is that it's not as robust as as_single_value() 🤔

@philsmt
Copy link
Collaborator

philsmt commented Apr 29, 2025

I was thinking in the direction of loading a RUN value for each file and do conflict resolution then. That should be cheap enough, a typical sequence is ~50s of measurement time.

@tmichela
Copy link
Member

Could an alternative be that the Scantool component accepts a list of runs rather than a union, such that you can read the RUN values for each and validate that they do not diverge?
Maybe that's something we could even upstream to EXtra-data (expose RUN data which is indentical in each of the merged DataCollection / raise if we acccess one that diverge).

@JamesWrigley
Copy link
Member Author

JamesWrigley commented Apr 30, 2025

Isn't a list of runs what a union is supposed to be though? Like conceptually I think in these situations what people would want to use is a single object that represents all the runs that they can .select() on etc.

Maybe that's something we could even upstream to EXtra-data (expose RUN data which is indentical in each of the merged DataCollection / raise if we acccess one that diverge).

That could work 👍 And it would simplify this PR since we could stick to reading RUN data.

@tmichela
Copy link
Member

Isn't a list of runs what a union is supposed to be though?

Conceptually yes, but technically a union merges DataCollection which in the end results in a single object with no knowledge of what a run is. And ending up by default to ignore RUN data if the unioned DataCollections are not from a single run. The suggestion I made merely help going around this limitation. Though I prefer the second option I mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants