Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WFCatalog metadata dependency #23

Open
jbienkowski opened this issue Feb 26, 2021 · 8 comments
Open

WFCatalog metadata dependency #23

jbienkowski opened this issue Feb 26, 2021 · 8 comments
Assignees

Comments

@jbienkowski
Copy link
Member

jbienkowski commented Feb 26, 2021

Currently, WFCatalog does not depend on station metadata - it calculates metrics for acquired data even if some channels are not defined in StationXML. In those cases users can retrieve the metrics, but are not able to download the data itself via FDSNWS-Dataselect web service which strongly depends on metadata.

Possible solutions:

  1. Exclude channels that are not defined in StationXML from WFCatalog collector processing
  2. Calculate the metrics anyways, but apply filtering on the web service side
  3. Cross-check on metadata in the downstream product (apparently the original approach)
  4. Add a metadata query parameter with default value true in the WFCatalog implementation which would still allow retrieval of all available metrics
@damb
Copy link
Contributor

damb commented Feb 26, 2021

@jbienkowski, StationXML metadata validation may be optionally enabled/disabled. So it depends on the user of the software if there waveform catalog metadata is served or not. If the validation configuration changes, data needs to be reprocessed.

Question: Is reprocessing a crucial issue. I mean performance-wise? The workflow could be something like:

  • check if waveform metadata is available
  • if not, then generate the waveform metadata

Note, that this change requires adjusting the collector's delete facilities, too.

Calculate the metrics anyways, but apply filtering on the web service side

I'm not sure if I would implement this approach. Imagine a request which queries the entire waveform metadata inventory. Then, filtering becomes costly. I'm aware that there are service level configuration parameters such as

"MAXIMUM_POST_BYTES": 1e4,
"MAXIMUM_GET_BYTES": 1e4,
"MAXIMUM_SEGMENTS": 50,
and
"MAXIMUM_BYTES_RETURNED": 10e6,
available.

@damb
Copy link
Contributor

damb commented Feb 26, 2021

@jbienkowski, have you seen already this

"FILTERS": {
"WHITE": ["*"],
"BLACK": []
configuration option which enables file based filtering while collecting?

See also

def _filterFiles(self):
"""
WFCatalogCollector._filterFiles
> Check if we wish to update the documents
> If not updating, documents that already exist in the database are skipped
> Documents are identified by filename
"""
# Validate the white and black list
self._validateFilters()
self.files = [f for f in self.files if self._passFilter(os.path.basename(f))]
# Return immediately if deleting
if self.args['delete']:
return
# Get the new files from the directory that are not in the database
new_files = [file for file in self.files if self._isNewDocument(file)]
self.log.info("Discovered %d new file(s) for processing" % (len(new_files)))
# If we are updating, remove old documents and add changed document to the process list
if self.args['update']:
changed_files = self._getChangedFiles()
if self.args['force']:
self.log.info("Forcing update of %d file(s) in database" % (len(changed_files)))
else:
self.log.info("Discovered %d file(s) with changed checksum in database" % (len(changed_files)))
else:
changed_files = []
# Files to process is new + changed (when updating)
self.files = set(new_files + changed_files)
self.log.info("Begin processing of %d file(s)" % (len(self.files)))
self.totalFiles = len(self.files)
if self.totalFiles == 0:
self.log.info("No files for processing: doing nothing."); sys.exit(0)
and
def _passFilter(self, filename):
"""
WFCatalogCollector._passFilter
> Checks if filename matches a white/black list
> the blacklist had precedence over the whitelist
"""
for white_filter in CONFIG['FILTERS']['WHITE']:
# Match in the whitelist
if fnmatch.fnmatch(filename, white_filter):
# Check if overruled by blacklist
for black_filter in CONFIG['FILTERS']['BLACK']:
# Overruled, file is blacklisted
if fnmatch.fnmatch(filename, black_filter):
return False
# Not overruled, file is whitelisted
return True
# Default to false, not white listed so ignore
return False

@Jollyfant
Copy link
Collaborator

We originally decided not to include just the channels in the metadata because a lot of nodes wanted to have their full archive processed, and not just what is exposed through FDSNWS. I guess updating the white list is too much manual labor.. It is probably better to add another option and add the FDSNWS response [net.sta.loc.cha] to a hashmap and do a lookup on whether to skip or not.

@jschaeff
Copy link
Contributor

I would be in favor of processing everything, and filtering the output.
This is the same strategy as for data management.

This way, as soon as the metadata is available, wfcatalog can spit all the information out and there is no need to start looking for all the data to index each time some metadata is submitted . Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

@damb
Copy link
Contributor

damb commented Mar 27, 2021

Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

@jschaeff, I get your points. However, this approach implies:

  • Modification of the currently used DB schema i.e. introducing a field in order to store the restrictedStatus StationXML attribute property. As a consequence, this property is stored redundantly.
  • Keeping track of changes of the introduced restrictedStatus property. Note that this property might change later, basically at any point in time.

@jschaeff
Copy link
Contributor

jschaeff commented Mar 29, 2021

I often hit the wall of ignoring when the metadata changes.
We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases.
Or there could be an RSS feed service provided by all EIDA nodes and publishing metadata changes. Or a websocket system.
But this is a bit off topic, although it would help wfcatalog keeping track of metadata changes.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

@damb
Copy link
Contributor

damb commented Mar 29, 2021

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases.

Versioning most probably requires more than just a simple time stamp.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

OT: Interestingly, not caching StationXML metadata was a requirement when designing eidaws-federator. So, why should it be possible when implementing fdsnws-availability based on the eidaws-wfcatalog backend?

@jschaeff
Copy link
Contributor

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

Sorry, github sent my comment with some keyboard shortcut I hit ...
Yes the restriction is valid for a timeperiod. Good point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants