-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted cached band mask #7
Comments
This problem probably arises from many jobs trying to write the same zip file on the cluster, which is bad behaviour. Regardless of the reason, we can make sure that the script at least works when again locally after the failure on the cluster. The way to do that would be to do a try/except BadZipfile statement, to catch this exception when loading the band mask, and instead re-make the band mask from scratch. Specifically, this could be done for: File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask That could be replaced with:
Stopping the error on the cluster would be an additional and better fix for this specific problem, but this interim thing should be easy. |
The bug persists I'm afraid... To reproduce
Traceback error
Expected behaviour Additional info It goes into the It may be that while the jobs are running, masks are written by one job while others try to read it simultaneously, hence the error |
Thanks for updating the report @sathanas31 . A couple of questions since I am not too familiar with this part of the code: Is it a possibility to run a minimal number of trials locally before launching the jobs, and would that prevent any further issue when running on the cluster? If so, I think this would be the best workaround for the time being. I think ultimately we should decouple any creation of cached files from the actual minimization process (see also #247). |
Yes, running 1 trial locally just to get the band masks written is the way to do it.
Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the |
I prefer to run all the "preparatory" phases locally for the sake of easier control and (easier) debugging, so I like the idea of the As soon as you have a working implementation feel free to submit a PR :) |
Describe the bug
Cached band mask seems to be corrupted
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The band mask should be produced also on the cluster
Additional context
ERROR MESSAGE:
inj = self.mh.get_injector(season)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/minimisation.py", line 273, in get_injector
self._injectors[season_name] = self.add_injector(self.seasons[season_name], self.sources)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/minimisation.py", line 1004, in add_injector
return season.make_injector(sources, **self.inj_dict)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/data/init.py", line 272, in make_injector
return MCInjector.create(self, sources, **inj_kwargs)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 201, in create
return cls.subclasses[inj_name](season, sources, **inj_dict)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 407, in init
MCInjector.init(self, season, sources, **kwargs)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 231, in init
self.n_exp = self.calculate_n_exp()
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 430, in calculate_n_exp
self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 283, in calculate_n_exp_single
return np.sum(self.calculate_single_source(source, 1.)["ow"])
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 275, in calculate_single_source
source_mc, omega, band_mask = self.select_mc_band(source)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 250, in select_mc_band
band_mask = self.get_band_mask(source, min_dec, max_dec)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 463, in load_band_mask
self.band_mask_cache = sparse.load_npz(path)
File "/afs/ifh.de/user/b/bradascf/.local/lib/python3.6/site-packages/scipy/sparse/_matrix_io.py", line 133, in load_npz
matrix_format = loaded['format']
File "/afs/ifh.de/user/b/bradascf/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 255, in getitem
bytes = self.zip.open(key)
File "/cvmfs/icecube.opensciencegrid.org/py3-v4/RHEL_7_x86_64/lib/python3.6/zipfile.py", line 1373, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
The text was updated successfully, but these errors were encountered: