Skip to content

Conversation

@malcolmw
Copy link

I have added a new data group--/References--for fast access to referenced regions of continuous waveforms. The intended use is for efficiently creating segmented waveforms, by windowing continuous data, without duplicating waveforms.

An example use case is if one has a continuous waveform archive and wishes to process event-segmented waveforms, based on some event catalog. Re-running the same processing for a different event catalog would typically require re-extracting the waveforms, which will likely duplicate many data and can be time consuming. Using HDF5 region references allows the user to simply create a new set of pointers to the relevant data, saving time and storage.

@coveralls
Copy link

coveralls commented Apr 27, 2018

Coverage Status

Coverage increased (+0.01%) to 89.069% when pulling 50b5a66 on malcolmw:dev/waveform_references into fb88b35 on SeismicData:master.

Resolve Python 2/3 inconsitencies causing tests to fail.


Merge remote-tracking branch 'origin/dev/waveform_references' into dev/waveform_references
@malcolmw malcolmw force-pushed the dev/waveform_references branch from 706ec22 to 7f64a33 Compare April 27, 2018 16:15
@malcolmw
Copy link
Author

malcolmw commented May 22, 2018

Using this branch to merge a few other dev branches. This branch is not ready to be merged into master yet, closing for now.

@malcolmw malcolmw closed this May 22, 2018
@krischer
Copy link
Member

Again sorry for taking so long! What's the status of this?

I've read a bit through the code you've written (even though you've closed it) and using region references is a nice idea! One might have to be a bit careful as ASDF files can be modified so if for example the sampling rate of a trace is changed the reference would have to be deleted/invalidated/recalculated.

I'm not sure if you've seen this but pyasdf already contains a function that does something semantically fairly similar except that it does not use precomputed references but get's everything on the fly. But it does only extract the data samples one cares about and does not read data outside the given start and end time:

http://seismicdata.github.io/pyasdf/asdf_data_set.html#pyasdf.asdf_data_set.ASDFDataSet.get_waveforms and http://seismicdata.github.io/pyasdf/large_continuous_datasets.html

So a slower (but I don't know how much slower) version of what you did might be achievable by enabling to store the given set of parameters and give a name to it, e.g. instead of associating a region reference with a name, store the parameter of the get_waveforms() method.

This could probably also be somehow integrated into the .ifilter() function which is a lot more powerful for data selection purposes: http://seismicdata.github.io/pyasdf/querying_data.html

Let me know what you think!

@malcolmw
Copy link
Author

malcolmw commented Sep 12, 2018

Hey, Lion,

No worries about the delay! Thanks for circling back to this. I will reply to your comment from PR #49 : Add I/O for masked Traces here as well, since both pull requests were part of my solution to a single problem.

Objective:

Create a very efficient method to repeatedly extract segmented waveforms from continuous.

The Short Version:

Having a mechanism to extract waveform segments from continuous waveforms quickly is a very useful feature. To achieve this functionality for my own use case, I developed a method that stored precomputed lookup addresses (Dataset handles and sample-offset indices), but think it could be made more robust and general and would love to discuss.

The Long Version:

I derived an earthquake catalog containing ~160, 000 events; each of which I wanted to correlate with its 200 nearest neighbours for double-difference relocation. I was given a code to do this correlation, but it required waveforms to be stored in a single miniSEED file per event. This granularity made the problem scale poorly: When I tried to run hundreds of parallel threads, the High Performance Computing Center (HPCC) quickly informed me that I was overwhelming shared I/O resources. So I used ASDF to solve the problem.

First, I tried dumping all of my miniSEED event files into a single ASDF file, but retrieving waveforms was slow because of the high granularity. So, I tried working with continuous waveforms using a few different permutations of storage configurations and access methods.

Using RegionReferences was my first idea, but the problem is that RegionReferences can't be combined with ExternalLinks––I wanted to use ExternalLinks so that I could split my (12.5 TB) dataset into multiple files, but maintain a single access point to all my data.

The solution I settled on was to break my 12.5 TB, 9 year dataset into yearly ASDF files, create ExternalLinks to each of these in a single "head" file, and access the event-segmented waveforms using precomputed lookup addresses. Using precomputed addresses for event-segmented waveforms was necessary in my case; hundreds of millions of calls to get_waveforms() would have been too slow.

The HPCC didn't give me trouble when I correlated my dataset using this configuration and >1000 CPUs, and the wall-time was tenable (~1 week for ~30, 000, 000 event-pairs).

NOTE: Masked trace I/O was not a critical component of this method, but meant that I could store continuous waveforms (with gaps) without ballooning the number of Datasets in my ASDF file when encountering gappy data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants