Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: backend_options should be an instance of coiled.BackendOptions #756

Closed
crusaderky opened this issue Mar 31, 2023 · 15 comments
Closed
Assignees

Comments

@crusaderky
Copy link
Contributor

I'm failing to run any kind of A/B tests on Python 3.10 or 3.11.
coiled.Cluster fails with ValueError: backend_options should be an instance of coiled.BackendOptions.
e.g. https://github.com/coiled/coiled-runtime/actions/runs/4574323397
This only happens on A/B tests, and only on Python 3.10 and 3.11.
Regular PR/overnight tests on 3.10, jupyter notebooks on 3.10, and 3.8/3.9 A/B tests are fine.

Everything uses coiled-0.5.9.

CC @fjetter @shughes-uk @ntabris

@dchudz
Copy link
Collaborator

dchudz commented Mar 31, 2023

@crusaderky we can improve the error message here, but to immediately tell you what's wrong I'd need some logging or a minimal reproducer

what's **cluster_kwargs["small_cluster"], in the failing example?

@dchudz dchudz self-assigned this Mar 31, 2023
@jrbourbeau
Copy link
Member

I was seeing this over in #752 yesterday, but now no longer am (at least the latest CI run passed). Was this fixed on the coiled side, or is this just a happy accident?

@dchudz
Copy link
Collaborator

dchudz commented Apr 1, 2023

I'm not aware of any relevant coiled-side changes.

@crusaderky
Copy link
Contributor Author

crusaderky commented Apr 8, 2023

These are the kwargs:

  package_sync: true
  wait_for_workers: true
  scheduler_vm_types: [m6i.large]
  backend_options:
    send_prometheus_metrics: true
    spot: true
    spot_on_demand_fallback: true
    multizone: true
  n_workers: 10
  worker_vm_types: [m6i.large]  # 2CPU, 8GiB

@dchudz
Copy link
Collaborator

dchudz commented Apr 13, 2023

Thanks. We'll try to get to this either way (especially if it recurs).

But a minimal reproducer (small Python code I can run) would have us on it quicker.

@jrbourbeau
Copy link
Member

@crusaderky have you seen this error recently? If not, thoughts on closing for now? We can always re-open if needed

@crusaderky
Copy link
Contributor Author

crusaderky commented Apr 13, 2023

@shughes-uk
Copy link
Contributor

Can you print the backend_options setting in those tests? Really need to understand the shape of the object being passed in

@crusaderky
Copy link
Contributor Author

These are the kwargs:

  package_sync: true
  wait_for_workers: true
  scheduler_vm_types: [m6i.large]
  backend_options:
    send_prometheus_metrics: true
    spot: true
    spot_on_demand_fallback: true
    multizone: true
  n_workers: 10
  worker_vm_types: [m6i.large]  # 2CPU, 8GiB

@shughes-uk
Copy link
Contributor

These are the kwargs:

package_sync: true

wait_for_workers: true

scheduler_vm_types: [m6i.large]

backend_options:

send_prometheus_metrics: true
spot: true
spot_on_demand_fallback: true
multizone: true

n_workers: 10

worker_vm_types: [m6i.large] # 2CPU, 8GiB

This is yaml, i specifically need some python type information

@crusaderky
Copy link
Contributor Author

crusaderky commented Apr 14, 2023

This is yaml, i specifically need some python type information

Here it is:
https://github.com/coiled/benchmarks/actions/runs/4700058695

Works: ubuntu-latest-AB_baseline-1
Fails: ubuntu-latest-AB_py310-1
Fails: ubuntu-latest-AB_py311-1

In each artifact zip you will find

  1. pickle and YAML dump of all parameters to Coiled. They are dumped immediately before they are passed to Coiled, which fails; e.g.
  • cluster_kwargs.small_cluster.array.pickle
  • cluster_kwargs.small_cluster.array.yaml
>>> pickle.load(open("ubuntu-latest-AB_py310-1/cluster_kwargs.small_cluster.array.pickle", "rb"))
{'name': 'array-8feeff47',
 'environ': {'DASK_COILED__TOKEN': "edit: apologies for leaking this"},
 'tags': {'GITHUB_JOB': 'tests',
  'GITHUB_REF': 'refs/heads/guido/AB_crash',
  'GITHUB_RUN_ATTEMPT': '1',
  'GITHUB_RUN_ID': '4700058695',
  'GITHUB_RUN_NUMBER': '881',
  'GITHUB_SHA': 'b898ebb29464ff45a770f2c6b7f821558e7f1ca6'},
 'package_sync': True,
 'wait_for_workers': True,
 'scheduler_vm_types': ['m6i.large'],
 'backend_options': {'send_prometheus_metrics': True,
  'spot': True,
  'spot_on_demand_fallback': True,
  'multizone': True},
 'n_workers': 10,
 'worker_vm_types': ['m6i.large']}
  1. output of mamba env export: mamba_env_export.yml

[EDIT] apologies for leaking the token. It's still present in the artifacts so I'm afraid it will need to be regenerated.

FYI the dump is downstream of #794 and #795.

@ncclementi
Copy link
Contributor

@crusaderky just a heads up, with @ntabris we just regenerated and changed the token in the secrets. If this is needed again due to the artifacts let us know.

@ntabris
Copy link
Member

ntabris commented Apr 14, 2023

I tried those kwargs and it worked fine.

Does the error happen consistently or sporadically?

@crusaderky
Copy link
Contributor Author

Does the error happen consistently or sporadically?

It's reproducible 100% of the times.

@crusaderky
Copy link
Contributor Author

Closed by #793

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants