Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted cached band mask #7

Closed
fbradascio opened this issue Dec 18, 2019 · 5 comments
Closed

Corrupted cached band mask #7

fbradascio opened this issue Dec 18, 2019 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@fbradascio
Copy link
Contributor

Describe the bug
Cached band mask seems to be corrupted

To Reproduce
Steps to reproduce the behavior:

  1. Run any analysis on the cluster with a new catalogue before running it locally
  2. It will crush
  3. Run it again locally and you will see the same error

Expected behavior
The band mask should be produced also on the cluster

Additional context
ERROR MESSAGE:
inj = self.mh.get_injector(season)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/minimisation.py", line 273, in get_injector
self._injectors[season_name] = self.add_injector(self.seasons[season_name], self.sources)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/minimisation.py", line 1004, in add_injector
return season.make_injector(sources, **self.inj_dict)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/data/init.py", line 272, in make_injector
return MCInjector.create(self, sources, **inj_kwargs)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 201, in create
return cls.subclasses[inj_name](season, sources, **inj_dict)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 407, in init
MCInjector.init(self, season, sources, **kwargs)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 231, in init
self.n_exp = self.calculate_n_exp()
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 430, in calculate_n_exp
self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 283, in calculate_n_exp_single
return np.sum(self.calculate_single_source(source, 1.)["ow"])
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 275, in calculate_single_source
source_mc, omega, band_mask = self.select_mc_band(source)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 250, in select_mc_band
band_mask = self.get_band_mask(source, min_dec, max_dec)
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 463, in load_band_mask
self.band_mask_cache = sparse.load_npz(path)
File "/afs/ifh.de/user/b/bradascf/.local/lib/python3.6/site-packages/scipy/sparse/_matrix_io.py", line 133, in load_npz
matrix_format = loaded['format']
File "/afs/ifh.de/user/b/bradascf/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 255, in getitem
bytes = self.zip.open(key)
File "/cvmfs/icecube.opensciencegrid.org/py3-v4/RHEL_7_x86_64/lib/python3.6/zipfile.py", line 1373, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header

@robertdstein robertdstein added the bug Something isn't working label Dec 18, 2019
@robertdstein
Copy link
Member

robertdstein commented Dec 18, 2019

This problem probably arises from many jobs trying to write the same zip file on the cluster, which is bad behaviour. Regardless of the reason, we can make sure that the script at least works when again locally after the failure on the cluster.

The way to do that would be to do a try/except BadZipfile statement, to catch this exception when loading the band mask, and instead re-make the band mask from scratch.

Specifically, this could be done for:

File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])

That could be replaced with:

try:
    self.load_band_mask(mask_index[0])
except BadZipFile:
    self.make_injection_band_mask()
    self.load_band_mask(mask_index[0])

Stopping the error on the cluster would be an additional and better fix for this specific problem, but this interim thing should be easy.

@sathanas31
Copy link
Contributor

The bug persists I'm afraid...

To reproduce

  1. Have a catalog with 10 sources and reduce the chunk size to 10, so just 1 chunk per cat
  2. 10 jobs running on DESY cluster
  3. Band masks are created on the cache dir but cannot be loaded

Traceback error

File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 513, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 495, in load_band_mask
    self.band_mask_cache = sparse.load_npz(path)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/scipy/sparse/_matrix_io.py", line 144, in load_npz
    return cls((loaded['data'], loaded['indices'], loaded['indptr']), shape=loaded['shape'])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
    magic = bytes.read(len(format.MAGIC_PREFIX))
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 923, in read
    data = self._read1(n)
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 999, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 162, in <module>
    run_multiprocess(n_cpu=cfg.n_cpu, mh_dict=mh_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 141, in run_multiprocess
    with MultiProcessor(n_cpu=n_cpu, mh_dict=mh_dict) as r:
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 57, in __init__
    inj = self.mh.get_injector(season)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 310, in get_injector
    self._injectors[season_name] = self.add_injector(
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 1267, in add_injector
    return season.make_injector(sources, **self.inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/data/__init__.py", line 322, in make_injector
    return MCInjector.create(self, sources, **inj_kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 203, in create
    return BaseInjector.subclasses[inj_name](season, sources, **inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 427, in __init__
    MCInjector.__init__(self, season, sources, **kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 240, in __init__
    self.n_exp = self.calculate_n_exp()
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 457, in calculate_n_exp
    self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 295, in calculate_n_exp_single
    return np.sum(self.calculate_single_source(source, 1.0)["ow"])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 288, in calculate_single_source
    source_mc, omega, band_mask = self.select_mc_band(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 261, in select_mc_band
    band_mask = self.get_band_mask(source, min_dec, max_dec)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 517, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 493, in load_band_mask
    del self.band_mask_cache
AttributeError: band_mask_cache

Expected behaviour
After masks are created by make_injection_band_mask() they can be loaded by load_band_mask() while jobs are running.

Additional info

It goes into the except clause you added, the band masks are created, and I checked if they can be loaded externally which they can. Not surprisingly, if I run the analysis again for the same catalog the masks are loaded fine.

It may be that while the jobs are running, masks are written by one job while others try to read it simultaneously, hence the error

@mlincett
Copy link
Collaborator

mlincett commented Sep 1, 2023

Thanks for updating the report @sathanas31 . A couple of questions since I am not too familiar with this part of the code:

Is it a possibility to run a minimal number of trials locally before launching the jobs, and would that prevent any further issue when running on the cluster? If so, I think this would be the best workaround for the time being.

I think ultimately we should decouple any creation of cached files from the actual minimization process (see also #247).

@sathanas31
Copy link
Contributor

Yes, running 1 trial locally just to get the band masks written is the way to do it.
I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method

if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)

Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally.
Any ideas/suggestions how to move forward with this?

@mlincett
Copy link
Collaborator

mlincett commented Sep 1, 2023

Yes, running 1 trial locally just to get the band masks written is the way to do it. I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method

if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)

Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally. Any ideas/suggestions how to move forward with this?

I prefer to run all the "preparatory" phases locally for the sake of easier control and (easier) debugging, so I like the idea of the Submitter class taking care of this. If we need/want to use a dagman we can change the implementation at a later stage.

As soon as you have a working implementation feel free to submit a PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants