WFCatalog metadata dependency #23

jbienkowski · 2021-02-26T10:35:44Z

Currently, WFCatalog does not depend on station metadata - it calculates metrics for acquired data even if some channels are not defined in StationXML. In those cases users can retrieve the metrics, but are not able to download the data itself via FDSNWS-Dataselect web service which strongly depends on metadata.

Possible solutions:

Exclude channels that are not defined in StationXML from WFCatalog collector processing
Calculate the metrics anyways, but apply filtering on the web service side
Cross-check on metadata in the downstream product (apparently the original approach)
Add a metadata query parameter with default value true in the WFCatalog implementation which would still allow retrieval of all available metrics

The text was updated successfully, but these errors were encountered:

damb · 2021-02-26T10:52:43Z

@jbienkowski, StationXML metadata validation may be optionally enabled/disabled. So it depends on the user of the software if there waveform catalog metadata is served or not. If the validation configuration changes, data needs to be reprocessed.

Question: Is reprocessing a crucial issue. I mean performance-wise? The workflow could be something like:

check if waveform metadata is available
if not, then generate the waveform metadata

Note, that this change requires adjusting the collector's delete facilities, too.

Calculate the metrics anyways, but apply filtering on the web service side

I'm not sure if I would implement this approach. Imagine a request which queries the entire waveform metadata inventory. Then, filtering becomes costly. I'm aware that there are service level configuration parameters such as

wfcatalog/service/configuration.json

Lines 12 to 14 in 93f3d7b

    
           "MAXIMUM_POST_BYTES": 1e4, 
        
           "MAXIMUM_GET_BYTES": 1e4, 
        
           "MAXIMUM_SEGMENTS": 50,

and

wfcatalog/service/configuration.json

Line 19 in 93f3d7b

"MAXIMUM_BYTES_RETURNED": 10e6,

available.

damb · 2021-02-26T11:09:02Z

@jbienkowski, have you seen already this

wfcatalog/collector/config.json

Lines 20 to 22 in 93f3d7b

    
           "FILTERS": { 
        
             "WHITE": ["*"], 
        
             "BLACK": []

configuration option which enables file based filtering while collecting?

See also

wfcatalog/collector/WFCatalogCollector.py

Lines 475 to 512 in 93f3d7b

    
             def _filterFiles(self): 
        
               """ 
        
               WFCatalogCollector._filterFiles 
        
               > Check if we wish to update the documents  
        
               > If not updating, documents that already exist in the database are skipped 
        
               > Documents are identified by filename 
        
               """ 
        
               # Validate the white and black list 
        
               self._validateFilters() 
        
               self.files = [f for f in self.files if self._passFilter(os.path.basename(f))] 
        
               # Return immediately if deleting  
        
               if self.args['delete']: 
        
                 return 
        
               # Get the new files from the directory that are not in the database 
        
               new_files = [file for file in self.files if self._isNewDocument(file)] 
        
               self.log.info("Discovered %d new file(s) for processing" % (len(new_files))) 
        
               # If we are updating, remove old documents and add changed document to the process list 
        
               if self.args['update']: 
        
                 changed_files = self._getChangedFiles() 
        
                 if self.args['force']: 
        
                   self.log.info("Forcing update of %d file(s) in database" % (len(changed_files)))  
        
                 else: 
        
                   self.log.info("Discovered %d file(s) with changed checksum in database" % (len(changed_files))) 
        
               else: 
        
                 changed_files = [] 
        
               # Files to process is new + changed (when updating) 
        
               self.files = set(new_files + changed_files) 
        
               self.log.info("Begin processing of %d file(s)" % (len(self.files))) 
        
               self.totalFiles = len(self.files) 
        
               if self.totalFiles == 0: 
        
                 self.log.info("No files for processing: doing nothing."); sys.exit(0)

and

wfcatalog/collector/WFCatalogCollector.py

Lines 279 to 299 in 93f3d7b

    
             def _passFilter(self, filename): 
        
               """ 
        
               WFCatalogCollector._passFilter 
        
               > Checks if filename matches a white/black list 
        
               > the blacklist had precedence over the whitelist 
        
               """ 
        
               for white_filter in CONFIG['FILTERS']['WHITE']: 
        
                 # Match in the whitelist 
        
                 if fnmatch.fnmatch(filename, white_filter): 
        
                   # Check if overruled by blacklist 
        
                   for black_filter in CONFIG['FILTERS']['BLACK']: 
        
                     # Overruled, file is blacklisted 
        
                     if fnmatch.fnmatch(filename, black_filter): 
        
                       return False 
        
                   # Not overruled, file is whitelisted 
        
                   return True 
        
               # Default to false, not white listed so ignore 
        
               return False

Jollyfant · 2021-03-01T08:18:44Z

We originally decided not to include just the channels in the metadata because a lot of nodes wanted to have their full archive processed, and not just what is exposed through FDSNWS. I guess updating the white list is too much manual labor.. It is probably better to add another option and add the FDSNWS response [net.sta.loc.cha] to a hashmap and do a lookup on whether to skip or not.

jschaeff · 2021-03-25T13:19:31Z

I would be in favor of processing everything, and filtering the output.
This is the same strategy as for data management.

This way, as soon as the metadata is available, wfcatalog can spit all the information out and there is no need to start looking for all the data to index each time some metadata is submitted . Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

damb · 2021-03-27T17:34:47Z

Of course, it should be implemented without making a fdsnws-station call for every wfcatalog request.

@jschaeff, I get your points. However, this approach implies:

Modification of the currently used DB schema i.e. introducing a field in order to store the restrictedStatus StationXML attribute property. As a consequence, this property is stored redundantly.
Keeping track of changes of the introduced restrictedStatus property. Note that this property might change later, basically at any point in time.

jschaeff · 2021-03-29T07:33:32Z

I often hit the wall of ignoring when the metadata changes.
We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases.
Or there could be an RSS feed service provided by all EIDA nodes and publishing metadata changes. Or a websocket system.
But this is a bit off topic, although it would help wfcatalog keeping track of metadata changes.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

damb · 2021-03-29T07:58:13Z

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

We miss a datestamp on the stationXML format, that would be usefull in a lot of usecases.

Versioning most probably requires more than just a simple time stamp.

Could wfcatalog manage a cache of the StationXML metadata for each network he knows about (or just the part he needs) ? The cache can be refreshed at arbitrary frequency or manualy.

OT: Interestingly, not caching StationXML metadata was a requirement when designing eidaws-federator. So, why should it be possible when implementing fdsnws-availability based on the eidaws-wfcatalog backend?

jschaeff · 2021-03-29T08:22:23Z

Basicaly, it's about storing a dictionary NSLC:boolean and the°+°

Unfortunately, that's not enough. The restrictedStatus references a ChannelEpoch such that the startDate and endDate needs to be part of the dict key. However, this epoch information might change, too. Besides, it is not strictly defined how the restrictedStatus attribute property is inherited to child nodes.

Sorry, github sent my comment with some keyboard shortcut I hit ...
Yes the restriction is valid for a timeperiod. Good point.

jbienkowski self-assigned this Feb 26, 2021

jbienkowski added the enhancement label Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WFCatalog metadata dependency #23

WFCatalog metadata dependency #23

jbienkowski commented Feb 26, 2021 •

edited

Loading

damb commented Feb 26, 2021 •

edited

Loading

damb commented Feb 26, 2021

Jollyfant commented Mar 1, 2021

jschaeff commented Mar 25, 2021

damb commented Mar 27, 2021

jschaeff commented Mar 29, 2021 •

edited

Loading

damb commented Mar 29, 2021

jschaeff commented Mar 29, 2021

WFCatalog metadata dependency #23

WFCatalog metadata dependency #23

Comments

jbienkowski commented Feb 26, 2021 • edited Loading

damb commented Feb 26, 2021 • edited Loading

damb commented Feb 26, 2021

Jollyfant commented Mar 1, 2021

jschaeff commented Mar 25, 2021

damb commented Mar 27, 2021

jschaeff commented Mar 29, 2021 • edited Loading

damb commented Mar 29, 2021

jschaeff commented Mar 29, 2021

jbienkowski commented Feb 26, 2021 •

edited

Loading

damb commented Feb 26, 2021 •

edited

Loading

jschaeff commented Mar 29, 2021 •

edited

Loading