Corrupted cached band mask #7

fbradascio · 2019-12-18T14:47:35Z

Describe the bug
Cached band mask seems to be corrupted

To Reproduce
Steps to reproduce the behavior:

Run any analysis on the cluster with a new catalogue before running it locally
It will crush
Run it again locally and you will see the same error

Expected behavior
The band mask should be produced also on the cluster

Additional context
ERROR MESSAGE:
inj = self.mh.get_injector(season)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/minimisation.py", line 273, in get_injector
self._injectors[season_name] = self.add_injector(self.seasons[season_name], self.sources)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/minimisation.py", line 1004, in add_injector
return season.make_injector(sources, **self.inj_dict)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/data/init.py", line 272, in make_injector
return MCInjector.create(self, sources, **inj_kwargs)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 201, in create
return cls.subclasses[inj_name](season, sources, **inj_dict)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 407, in init
MCInjector.init(self, season, sources, **kwargs)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 231, in init
self.n_exp = self.calculate_n_exp()
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 430, in calculate_n_exp
self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 283, in calculate_n_exp_single
return np.sum(self.calculate_single_source(source, 1.)["ow"])
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 275, in calculate_single_source
source_mc, omega, band_mask = self.select_mc_band(source)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 250, in select_mc_band
band_mask = self.get_band_mask(source, min_dec, max_dec)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 463, in load_band_mask
self.band_mask_cache = sparse.load_npz(path)
File "/afs/ifh.de/user/b/bradascf/.local/lib/python3.6/site-packages/scipy/sparse/_matrix_io.py", line 133, in load_npz
matrix_format = loaded['format']
File "/afs/ifh.de/user/b/bradascf/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 255, in getitem
bytes = self.zip.open(key)
File "/cvmfs/icecube.opensciencegrid.org/py3-v4/RHEL_7_x86_64/lib/python3.6/zipfile.py", line 1373, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header

robertdstein · 2019-12-18T16:09:53Z

This problem probably arises from many jobs trying to write the same zip file on the cluster, which is bad behaviour. Regardless of the reason, we can make sure that the script at least works when again locally after the failure on the cluster.

The way to do that would be to do a try/except BadZipfile statement, to catch this exception when loading the band mask, and instead re-make the band mask from scratch.

Specifically, this could be done for:

File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])

That could be replaced with:

try:
    self.load_band_mask(mask_index[0])
except BadZipFile:
    self.make_injection_band_mask()
    self.load_band_mask(mask_index[0])

Stopping the error on the cluster would be an additional and better fix for this specific problem, but this interim thing should be easy.

sathanas31 · 2023-08-24T14:24:13Z

The bug persists I'm afraid...

To reproduce

Have a catalog with 10 sources and reduce the chunk size to 10, so just 1 chunk per cat
10 jobs running on DESY cluster
Band masks are created on the cache dir but cannot be loaded

Traceback error

File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 513, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 495, in load_band_mask
    self.band_mask_cache = sparse.load_npz(path)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/scipy/sparse/_matrix_io.py", line 144, in load_npz
    return cls((loaded['data'], loaded['indices'], loaded['indptr']), shape=loaded['shape'])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
    magic = bytes.read(len(format.MAGIC_PREFIX))
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 923, in read
    data = self._read1(n)
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 999, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 162, in <module>
    run_multiprocess(n_cpu=cfg.n_cpu, mh_dict=mh_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 141, in run_multiprocess
    with MultiProcessor(n_cpu=n_cpu, mh_dict=mh_dict) as r:
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 57, in __init__
    inj = self.mh.get_injector(season)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 310, in get_injector
    self._injectors[season_name] = self.add_injector(
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 1267, in add_injector
    return season.make_injector(sources, **self.inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/data/__init__.py", line 322, in make_injector
    return MCInjector.create(self, sources, **inj_kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 203, in create
    return BaseInjector.subclasses[inj_name](season, sources, **inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 427, in __init__
    MCInjector.__init__(self, season, sources, **kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 240, in __init__
    self.n_exp = self.calculate_n_exp()
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 457, in calculate_n_exp
    self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 295, in calculate_n_exp_single
    return np.sum(self.calculate_single_source(source, 1.0)["ow"])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 288, in calculate_single_source
    source_mc, omega, band_mask = self.select_mc_band(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 261, in select_mc_band
    band_mask = self.get_band_mask(source, min_dec, max_dec)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 517, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 493, in load_band_mask
    del self.band_mask_cache
AttributeError: band_mask_cache

Expected behaviour
After masks are created by make_injection_band_mask() they can be loaded by load_band_mask() while jobs are running.

Additional info

It goes into the except clause you added, the band masks are created, and I checked if they can be loaded externally which they can. Not surprisingly, if I run the analysis again for the same catalog the masks are loaded fine.

It may be that while the jobs are running, masks are written by one job while others try to read it simultaneously, hence the error

mlincett · 2023-09-01T08:53:34Z

Thanks for updating the report @sathanas31 . A couple of questions since I am not too familiar with this part of the code:

Is it a possibility to run a minimal number of trials locally before launching the jobs, and would that prevent any further issue when running on the cluster? If so, I think this would be the best workaround for the time being.

I think ultimately we should decouple any creation of cached files from the actual minimization process (see also #247).

sathanas31 · 2023-09-01T09:50:45Z

Yes, running 1 trial locally just to get the band masks written is the way to do it.
I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method

if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)

Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally.
Any ideas/suggestions how to move forward with this?

mlincett · 2023-09-01T11:28:36Z

Yes, running 1 trial locally just to get the band masks written is the way to do it. I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method
if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)
Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally. Any ideas/suggestions how to move forward with this?

I prefer to run all the "preparatory" phases locally for the sake of easier control and (easier) debugging, so I like the idea of the Submitter class taking care of this. If we need/want to use a dagman we can change the implementation at a later stage.

As soon as you have a working implementation feel free to submit a PR :)

robertdstein added the bug Something isn't working label Dec 18, 2019

robertdstein assigned fbradascio Jan 25, 2020

robertdstein mentioned this issue Apr 8, 2020

At least one local run required #18

Open

sathanas31 added a commit that referenced this issue Sep 11, 2023

make band masks wrapper script for large_catalog minimizer, fixes #7

a1900d6

sathanas31 mentioned this issue Sep 11, 2023

Band masks fix #307

Merged

sathanas31 closed this as completed in 96da82a Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted cached band mask #7

Corrupted cached band mask #7

fbradascio commented Dec 18, 2019

robertdstein commented Dec 18, 2019 •

edited

Loading

sathanas31 commented Aug 24, 2023

mlincett commented Sep 1, 2023

sathanas31 commented Sep 1, 2023

mlincett commented Sep 1, 2023

Corrupted cached band mask #7

Corrupted cached band mask #7

Comments

fbradascio commented Dec 18, 2019

robertdstein commented Dec 18, 2019 • edited Loading

sathanas31 commented Aug 24, 2023

mlincett commented Sep 1, 2023

sathanas31 commented Sep 1, 2023

mlincett commented Sep 1, 2023

robertdstein commented Dec 18, 2019 •

edited

Loading