-
Notifications
You must be signed in to change notification settings - Fork 30
Add HDF5 RegionReferences for efficiently windowing waveform segments from continuous data #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Resolve Python 2/3 inconsitencies causing tests to fail. Merge remote-tracking branch 'origin/dev/waveform_references' into dev/waveform_references
706ec22 to
7f64a33
Compare
Make backwards compatible with older ASDF versions I/O of masked arrays of float/integer data. Ignore some test files.
Make backwards compatible with older ASDF versions I/O of masked arrays of float/integer data. Ignore some test files. Fix broken identity test
|
Using this branch to merge a few other dev branches. This branch is not ready to be merged into master yet, closing for now. |
|
Again sorry for taking so long! What's the status of this? I've read a bit through the code you've written (even though you've closed it) and using region references is a nice idea! One might have to be a bit careful as ASDF files can be modified so if for example the sampling rate of a trace is changed the reference would have to be deleted/invalidated/recalculated. I'm not sure if you've seen this but http://seismicdata.github.io/pyasdf/asdf_data_set.html#pyasdf.asdf_data_set.ASDFDataSet.get_waveforms and http://seismicdata.github.io/pyasdf/large_continuous_datasets.html So a slower (but I don't know how much slower) version of what you did might be achievable by enabling to store the given set of parameters and give a name to it, e.g. instead of associating a region reference with a name, store the parameter of the This could probably also be somehow integrated into the Let me know what you think! |
|
Hey, Lion, No worries about the delay! Thanks for circling back to this. I will reply to your comment from PR #49 : Add I/O for masked Traces here as well, since both pull requests were part of my solution to a single problem. Objective:
The Short Version:
The Long Version: I derived an earthquake catalog containing ~160, 000 events; each of which I wanted to correlate with its 200 nearest neighbours for double-difference relocation. I was given a code to do this correlation, but it required waveforms to be stored in a single miniSEED file per event. This granularity made the problem scale poorly: When I tried to run hundreds of parallel threads, the High Performance Computing Center (HPCC) quickly informed me that I was overwhelming shared I/O resources. So I used ASDF to solve the problem. First, I tried dumping all of my miniSEED event files into a single ASDF file, but retrieving waveforms was slow because of the high granularity. So, I tried working with continuous waveforms using a few different permutations of storage configurations and access methods. Using RegionReferences was my first idea, but the problem is that RegionReferences can't be combined with ExternalLinks––I wanted to use ExternalLinks so that I could split my (12.5 TB) dataset into multiple files, but maintain a single access point to all my data. The solution I settled on was to break my 12.5 TB, 9 year dataset into yearly ASDF files, create ExternalLinks to each of these in a single "head" file, and access the event-segmented waveforms using precomputed lookup addresses. Using precomputed addresses for event-segmented waveforms was necessary in my case; hundreds of millions of calls to The HPCC didn't give me trouble when I correlated my dataset using this configuration and >1000 CPUs, and the wall-time was tenable (~1 week for ~30, 000, 000 event-pairs). NOTE: Masked trace I/O was not a critical component of this method, but meant that I could store continuous waveforms (with gaps) without ballooning the number of Datasets in my ASDF file when encountering gappy data. |
I have added a new data group--/References--for fast access to referenced regions of continuous waveforms. The intended use is for efficiently creating segmented waveforms, by windowing continuous data, without duplicating waveforms.
An example use case is if one has a continuous waveform archive and wishes to process event-segmented waveforms, based on some event catalog. Re-running the same processing for a different event catalog would typically require re-extracting the waveforms, which will likely duplicate many data and can be time consuming. Using HDF5 region references allows the user to simply create a new set of pointers to the relevant data, saving time and storage.