Add HDF5 RegionReferences for efficiently windowing waveform segments from continuous data #48

malcolmw · 2018-04-27T05:39:09Z

I have added a new data group--/References--for fast access to referenced regions of continuous waveforms. The intended use is for efficiently creating segmented waveforms, by windowing continuous data, without duplicating waveforms.

An example use case is if one has a continuous waveform archive and wishes to process event-segmented waveforms, based on some event catalog. Re-running the same processing for a different event catalog would typically require re-extracting the waveforms, which will likely duplicate many data and can be time consuming. Using HDF5 region references allows the user to simply create a new set of pointers to the relevant data, saving time and storage.

… continuous data.

coveralls · 2018-04-27T05:48:34Z

Coverage increased (+0.01%) to 89.069% when pulling 50b5a66 on malcolmw:dev/waveform_references into fb88b35 on SeismicData:master.

Resolve Python 2/3 inconsitencies causing tests to fail. Merge remote-tracking branch 'origin/dev/waveform_references' into dev/waveform_references

Make backwards compatible with older ASDF versions I/O of masked arrays of float/integer data. Ignore some test files.

Make backwards compatible with older ASDF versions I/O of masked arrays of float/integer data. Ignore some test files. Fix broken identity test

malcolmw · 2018-05-22T05:04:02Z

Using this branch to merge a few other dev branches. This branch is not ready to be merged into master yet, closing for now.

krischer · 2018-09-11T19:53:27Z

Again sorry for taking so long! What's the status of this?

I've read a bit through the code you've written (even though you've closed it) and using region references is a nice idea! One might have to be a bit careful as ASDF files can be modified so if for example the sampling rate of a trace is changed the reference would have to be deleted/invalidated/recalculated.

I'm not sure if you've seen this but pyasdf already contains a function that does something semantically fairly similar except that it does not use precomputed references but get's everything on the fly. But it does only extract the data samples one cares about and does not read data outside the given start and end time:

http://seismicdata.github.io/pyasdf/asdf_data_set.html#pyasdf.asdf_data_set.ASDFDataSet.get_waveforms and http://seismicdata.github.io/pyasdf/large_continuous_datasets.html

So a slower (but I don't know how much slower) version of what you did might be achievable by enabling to store the given set of parameters and give a name to it, e.g. instead of associating a region reference with a name, store the parameter of the get_waveforms() method.

This could probably also be somehow integrated into the .ifilter() function which is a lot more powerful for data selection purposes: http://seismicdata.github.io/pyasdf/querying_data.html

Let me know what you think!

malcolmw · 2018-09-12T01:50:37Z

Hey, Lion,

No worries about the delay! Thanks for circling back to this. I will reply to your comment from PR #49 : Add I/O for masked Traces here as well, since both pull requests were part of my solution to a single problem.

Objective:

Create a very efficient method to repeatedly extract segmented waveforms from continuous.

The Short Version:

Having a mechanism to extract waveform segments from continuous waveforms quickly is a very useful feature. To achieve this functionality for my own use case, I developed a method that stored precomputed lookup addresses (Dataset handles and sample-offset indices), but think it could be made more robust and general and would love to discuss.

The Long Version:

I derived an earthquake catalog containing ~160, 000 events; each of which I wanted to correlate with its 200 nearest neighbours for double-difference relocation. I was given a code to do this correlation, but it required waveforms to be stored in a single miniSEED file per event. This granularity made the problem scale poorly: When I tried to run hundreds of parallel threads, the High Performance Computing Center (HPCC) quickly informed me that I was overwhelming shared I/O resources. So I used ASDF to solve the problem.

First, I tried dumping all of my miniSEED event files into a single ASDF file, but retrieving waveforms was slow because of the high granularity. So, I tried working with continuous waveforms using a few different permutations of storage configurations and access methods.

Using RegionReferences was my first idea, but the problem is that RegionReferences can't be combined with ExternalLinks––I wanted to use ExternalLinks so that I could split my (12.5 TB) dataset into multiple files, but maintain a single access point to all my data.

The solution I settled on was to break my 12.5 TB, 9 year dataset into yearly ASDF files, create ExternalLinks to each of these in a single "head" file, and access the event-segmented waveforms using precomputed lookup addresses. Using precomputed addresses for event-segmented waveforms was necessary in my case; hundreds of millions of calls to get_waveforms() would have been too slow.

The HPCC didn't give me trouble when I correlated my dataset using this configuration and >1000 CPUs, and the wall-time was tenable (~1 week for ~30, 000, 000 event-pairs).

NOTE: Masked trace I/O was not a critical component of this method, but meant that I could store continuous waveforms (with gaps) without ballooning the number of Datasets in my ASDF file when encountering gappy data.

Malcolm White added 3 commits April 26, 2018 14:30

Add a /References data group for fast access to referenced regions of…

73ae2bc

… continuous data.

Make Reference creation and retrieval more flexible.

c5b6f31

Add documentation and tests.

8d806c5

Resolve Python 2/3 inconsitencies causing tests to fail.

7f64a33

Resolve Python 2/3 inconsitencies causing tests to fail. Merge remote-tracking branch 'origin/dev/waveform_references' into dev/waveform_references

malcolmw force-pushed the dev/waveform_references branch from 706ec22 to 7f64a33 Compare April 27, 2018 16:15

Malcolm White added 12 commits April 27, 2018 15:49

Add some more tests.

c7048e6

Add missing return statement.

50b5a66

Use path/offset methodology instead of RegionReferences.

e592b6f

Add I/O for masked Traces

bed89ce

Make backwards compatible with older ASDF versions I/O of masked arrays of float/integer data. Ignore some test files.

Add I/O for masked Traces

6d590f1

Make backwards compatible with older ASDF versions I/O of masked arrays of float/integer data. Ignore some test files. Fix broken identity test

Fix broken test

b0751e5

Add test for masked Trace I/O

c6db4d2

Add argument to configure dataset chunking

901017c

Nest References group under AuxiliaryData

9ffde25

Merge branch 'dev/masked_trace_IO' into dev/waveform_references

23a1a1a

Merge branch 'dev/rdcc' into dev/waveform_references

7891e9e

Refactor waveform extraction

945c4d0

malcolmw closed this May 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add HDF5 RegionReferences for efficiently windowing waveform segments from continuous data #48

Add HDF5 RegionReferences for efficiently windowing waveform segments from continuous data #48

Uh oh!

malcolmw commented Apr 27, 2018

Uh oh!

coveralls commented Apr 27, 2018 •

edited

Loading

Uh oh!

malcolmw commented May 22, 2018 •

edited

Loading

Uh oh!

krischer commented Sep 11, 2018

Uh oh!

malcolmw commented Sep 12, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add HDF5 RegionReferences for efficiently windowing waveform segments from continuous data #48

Add HDF5 RegionReferences for efficiently windowing waveform segments from continuous data #48

Uh oh!

Conversation

malcolmw commented Apr 27, 2018

Uh oh!

coveralls commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malcolmw commented May 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krischer commented Sep 11, 2018

Uh oh!

malcolmw commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Apr 27, 2018 •

edited

Loading

malcolmw commented May 22, 2018 •

edited

Loading

malcolmw commented Sep 12, 2018 •

edited

Loading