Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Durin plugin from XDS #30

Open
JunAishima opened this issue Nov 6, 2023 · 16 comments
Open

Problems with Durin plugin from XDS #30

JunAishima opened this issue Nov 6, 2023 · 16 comments

Comments

@JunAishima
Copy link

We are having some issues with XDS using Durin where the plugin appears to have problems with memory allocation. This doesn't happen with all datasets. If it will be helpful, I can get permission to share the dataset offline.

  • Eiger2 9M
  • RHEL 8
  • I believe the Durin is from the v2023-10 tagged version
  • COLSPOT step of XDS causes this problem
  • does not depend on image number

malloc(): unaligned tcache chunk detected
free(): invalid pointer
Aborted (core dumped)

@JunAishima
Copy link
Author

Graeme suggested offline that I could try to check the last release - I downloaded the old plugin and tried, and got different issues. Hopefully this helps though.

Old plugin:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
xds_par 0000000000561823 Unknown Unknown Unknown
libpthread-2.28.s 00007F806560ECF0 Unknown Unknown Unknown
libc-2.28.so 00007F80650C6E85 IO_default_xsput Unknown Unknown
libc-2.28.so 00007F8065099AEB IO_vfprintf Unknown Unknown
libc-2.28.so 00007F80650BBB15 vsprintf Unknown Unknown
libc-2.28.so 00007F80650A2078 sprintf Unknown Unknown
durin-plugin.so 00007F80646F8922 get_dectris_eiger Unknown Unknown
durin-plugin.so 00007F80646F7BCA create_dataset_de Unknown Unknown
durin-plugin.so 00007F80646F99F6 get_detector_info Unknown Unknown
durin-plugin.so 00007F80646F7321 plugin_open Unknown Unknown
xds_par 0000000000417307 generic_data_plug 301 generic_data_plugin.f90
xds_par 0000000000430F2F generic_getfrm
25757 MAIN_XDS.f90
xds_par 000000000042F0B3 getfrm
25859 MAIN_XDS.f90
xds_par 000000000042BC06 parser_ 13513 MAIN_XDS.f90
xds_par 000000000041AC1B xds_ 21547 MAIN_XDS.f90
xds_par 0000000000418661 MAIN__ 1 MAIN_XDS.f90
xds_par 0000000000415822 Unknown Unknown Unknown
libc-2.28.so 00007F806506DD85 __libc_start_main Unknown Unknown
xds_par 0000000000415729 Unknown Unknown Unknown

New plugin:
free(): invalid pointer
Aborted (core dumped)

@graeme-winter
Copy link
Collaborator

Thanks for that @JunAishima - I presume any data set should trigger the problem even blanks? A test example would really help for debugging. The stack trace you sent is from XDS not Durin

@c-mita
Copy link
Contributor

c-mita commented Nov 8, 2023

I would wager the error is coming from inside durin:

libc-2.28.so 00007F80650A2078 sprintf Unknown Unknown
durin-plugin.so 00007F80646F8922 get_dectris_eiger Unknown Unknown
durin-plugin.so 00007F80646F7BCA create_dataset_de Unknown Unknown
durin-plugin.so 00007F80646F99F6 get_detector_info Unknown Unknown
durin-plugin.so 00007F80646F7321 plugin_open Unknown Unknown

It's hard to tell which function exactly; the names appeared to be shortened and there are several starting with get_dectris_eiger (and they all use sprintf).

Can you output the symbol table for durin-plugin.so (something like objdump -TC durin-plugin.so) and identify which function exactly is causing the problem? (may need to work out the offset the plugin was loaded at?)

I would guess this is just a case of sprintf just running past the end of an input buffer. I thought I was more careful than that, but...

I think the code should just be updated to use snprintf. I think when I wrote this I was unnecessarily targeting a C89...

@graeme-winter
Copy link
Collaborator

Thank you @c-mita - given example data this would be much easier to debug

That said - @JunAishima - if you could attach output of h5ls -rv that would be maybe helpful, so can look for a long string / likely overrun

@JunAishima
Copy link
Author

h5ls_out_5111.txt

What is the best way to send a dataset to you? I got the ok for that

@graeme-winter
Copy link
Collaborator

@JunAishima thank you I see the data arriving as we discussed out of band. The h5ls gave me no clues: I will now see if I can reproduce your problems.

@graeme-winter
Copy link
Collaborator

OK, thank you for sharing the data

I note that

/entry/data/data_000001  External Link {NYX_B006_5111_data_000001.h5//entry/data/data}
/entry/data/data_000002  External Link {NYX_B006_5111_data_000002.h5//entry/data/data}
/entry/data/data_000003  External Link {NYX_B006_5111_data_000003.h5//entry/data/data}
/entry/data/data_000004  External Link {NYX_B006_5111_data_000004.h5//entry/data/data}

There are 4 entries here but you only sent 3 data files. The 3rd data set contains only a few images but is properly formatted

Grey-Area Downloads :) $ h5ls -r NYX_B006_5111_data_000003.h5 
/                        Group
/entry                   Group
/entry/data              Group
/entry/data/data         Dataset {8/Inf, 3262, 3108}

I will test in a second (downloading the big data files now) but I suspect that this inconsistency is what is causing durin some problems. We may need to figure out a workaround.

If you only process the first 2000 images in XDS does everything work?

@graeme-winter
Copy link
Collaborator

I get a segmentation fault under macOS as well, so we have a real problem here. Now investigating to work out precisely what the problem is. I suspect just processing 2,000 frames won't help

@graeme-winter
Copy link
Collaborator

graeme-winter commented Nov 9, 2023

Error for me (surprisingly) at

err = H5Dread(ds_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, buffer);

reading the pixel_mask

@graeme-winter
Copy link
Collaborator

(gdb) p data_desc->dims[1] 
$2 = 0
(gdb) p data_desc->dims[2] 
$3 = 0

OK, so we have malloc'd no bytes? That won't end well

  • I am surprised that malloc(0) does not give an error
  • I do not know why the dimensions are zero

But this would trigger an error when one comes to actually write some data in

@JunAishima
Copy link
Author

This is an Eiger2 - I see what appear to be the dimensions of the image in the HDF file at /entry/instrument/detector/module/data_size. Another set of fields that should contain the dimensions are /entry/instrument/detector/detectorSpecific/x_pixels_in_detector (also y_pixels...)

@c-mita
Copy link
Contributor

c-mita commented Nov 9, 2023

I am surprised that malloc(0) does not give an error

malloc(0) returns a pointer that should not dereferenced (usually NULL).

The missing external file might be causing the problem.

When calculating the dataset dimensions for a Dectris dataset, it walks over all numbered datasets in /entry/data in order, counting the length of the overall dataset (the bounds set by XDS don't matter at this point). During this walk, it overwrites the x-y dimensions from previous datasets with whatever the last one it read was.

I would expect H5DOpen2 to fail when trying to open a dataset, but maybe not? Maybe it just gets an empty descriptor? And hence the last thing written to the x and y dimensions is 0?

@JunAishima
Copy link
Author

aah, I see that now. it looks like we have four data files defined (probably defined during the collection), and only 3 data files written out. Not sure how they precisely define the experiment, but it would help if it was being defined correctly. This could still be derailed by aborting a collection (not sure if they did that here).

The plugin should be able to handle this case properly as well, so sounds like there could be improvements on both sides.

graeme-winter added a commit that referenced this issue Nov 16, 2023
@graeme-winter
Copy link
Collaborator

Working on this properly now, I think it is a little more subtle than perhaps it at first looks. Short version - this is an unhandled error from

  if (retval < 0) {
    free(frame_counts);
  } else {
    memcpy(desc->dims, dims, 3 * sizeof(*dims));
    desc->data_width = data_width;
    eiger_desc->n_data_blocks = n_datas;
    eiger_desc->block_sizes = frame_counts;
  }
  return retval;

where the return status is not checked. I think if I patch it to not do the free but to reshape this we can make it work as is, which is inelegant but more useful (probably) than actually error'ing

graeme-winter added a commit that referenced this issue Nov 16, 2023
Work with what we have
@graeme-winter
Copy link
Collaborator

@JunAishima after some hacking I have something which appears to (at least partially) work - the indexing fails but that is probably a geometry error - would you like to have a go?

durin-plugin.zip

Currently working on the fix-30 branch, will need to do some additional tidying as there is an unhandled route which I think we probably want to handle (but maybe no?)

@graeme-winter
Copy link
Collaborator

@JunAishima sorry for the slow pick-up - but did you get a chance to test? I was tidying up unmerged branches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants