Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent timeouts from MSA server #664

Open
rachitk opened this issue Nov 18, 2024 · 17 comments
Open

Intermittent timeouts from MSA server #664

rachitk opened this issue Nov 18, 2024 · 17 comments

Comments

@rachitk
Copy link

rachitk commented Nov 18, 2024

Thank you so much again for making this resource available!

This is basically a duplicate of #646 but updated since the issues are much more intermittent.

Expected Behavior

Consistently receive MSA responses.

Current Behavior

Intermittent timeouts when trying to query the MSA server - sometimes retrying with the same sequence will work.

Steps to Reproduce (for bugs)

When trying to run colabfold-batch, I will randomly sometimes get errors like the below and sometimes will not.

ColabFold Output (for bugs)

Log

2024-11-18 06:09:46,837 Running colabfold 1.5.5
2024-11-18 06:09:47,152 Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA Interpreter
2024-11-18 06:09:47,152 Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
2024-11-18 06:09:50,052 Running on GPU
2024-11-18 06:09:50,823 Found 9 citations for tools or databases
2024-11-18 06:09:50,823 Query 1/1: 8e5ddc8d59f7cf19a6b897cb9437b63768321b969766cd60e4c9b7b5d6d24545 (length 291)
2024-11-18 06:09:51,799 Sleeping for 5s. Reason: PENDING
2024-11-18 06:09:57,537 Sleeping for 5s. Reason: RUNNING
2024-11-18 06:10:03,265 Sleeping for 5s. Reason: RUNNING
2024-11-18 06:10:08,996 Sleeping for 10s. Reason: RUNNING
2024-11-18 06:10:19,734 Sleeping for 9s. Reason: RUNNING
2024-11-18 06:10:29,686 Sleeping for 5s. Reason: RUNNING
2024-11-18 06:10:35,429 Sleeping for 9s. Reason: RUNNING
2024-11-18 06:10:45,134 Sleeping for 8s. Reason: RUNNING
2024-11-18 06:10:53,860 Sleeping for 9s. Reason: RUNNING
2024-11-18 06:11:03,595 Sleeping for 6s. Reason: RUNNING
2024-11-18 06:12:48,912 Error while fetching result from MSA server. Retrying... (1/5)
2024-11-18 06:12:48,913 Error: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.
2024-11-18 06:13:03,404 Could not get MSA/templates for 8e5ddc8d59f7cf19a6b897cb9437b63768321b969766cd60e4c9b7b5d6d24545: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.
Traceback (most recent call last):
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/urllib3/response.py", line 712, in _error_catcher
    yield
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/urllib3/response.py", line 812, in _raw_read
    data = self._fp_read(amt) if not fp_closed else b""
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/urllib3/response.py", line 797, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
  File "/usr/local/envs/colabfold/lib/python3.9/http/client.py", line 463, in read
    n = self.readinto(b)
  File "/usr/local/envs/colabfold/lib/python3.9/http/client.py", line 497, in readinto
    return self._readinto_chunked(b)
  File "/usr/local/envs/colabfold/lib/python3.9/http/client.py", line 597, in _readinto_chunked
    n = self._safe_readinto(mvb)
  File "/usr/local/envs/colabfold/lib/python3.9/http/client.py", line 642, in _safe_readinto
    n = self.fp.readinto(mvb)
  File "/usr/local/envs/colabfold/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/envs/colabfold/lib/python3.9/ssl.py", line 1275, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/envs/colabfold/lib/python3.9/ssl.py", line 1133, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/colabfold/batch.py", line 1453, in run
    = get_msa_and_templates(jobname, query_sequence, a3m_lines, result_dir, msa_mode, use_templates,
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/colabfold/batch.py", line 765, in get_msa_and_templates
    a3m_lines_mmseqs2, template_paths = run_mmseqs2(
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/colabfold/colabfold.py", line 294, in run_mmseqs2
    tar.extractall(path=TMPL_PATH)
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 2250, in extractall
    self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 2313, in _extract_one
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 2396, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 2449, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 251, in copyfileobj
    buf = src.read(bufsize)
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 525, in read
    buf = self._read(size)
  File "/usr/local/envs/colabfold/lib/python3.9/tarfile.py", line 543, in _read
    buf = self.fileobj.read(self.bufsize)
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/urllib3/response.py", line 877, in read
    data = self._raw_read(amt)
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/urllib3/response.py", line 833, in _raw_read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/envs/colabfold/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/envs/colabfold/lib/python3.9/site-packages/urllib3/response.py", line 717, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.") from e  # type: ignore[arg-type]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.
2024-11-18 06:13:03,424 Done

Context

curl https://api.colabfold.com/queue returns

{"queued":0}

Your Environment

I am currently running ColabFold locally using the most recent docker container (1.5.5) (https://github.com/sokrypton/ColabFold/wiki/Running-ColabFold-in-Docker) on a cluster system running CentOS7, with a Tesla T4 GPU. The cluster does have access to the internet.

@LJStewart5
Copy link

This may be a related issue. I am running ColabFold in personal Google account.

Here is an example of log code Downloading alphafold2_multimer_v3 weights to .: 100%|██████████| 3.82G/3.82G [00:29<00:00, 138MB/s]
2024-11-20 13:37:52,666 Running on GPU
2024-11-20 13:37:53,084 Found 9 citations for tools or databases
2024-11-20 13:37:53,084 Query 1/1: ELANE_ECP_2noleaderspdb100_d59cd (length 351)
COMPLETE: 100%|██████████| 300/300 [elapsed: 02:04 remaining: 00:00]2024-11-20 13:39:57,884 Error while fetching result from MSA server. Retrying... (1/5)
2024-11-20 13:39:57,889 Error: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.
COMPLETE: 100%|██████████| 300/300 [elapsed: 04:35 remaining: 00:00]
2024-11-20 13:45:16,445 Could not get MSA/templates for ELANE_ECP_2noleaderspdb100_d59cd: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 748, in _error_catcher
yield
File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 873, in _raw_read
data = self._fp_read(amt, read1=read1) if not fp_closed else b""
File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 856, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/usr/lib/python3.10/http/client.py", line 460, in read
return self._read_chunked(amt)
File "/usr/lib/python3.10/http/client.py", line 588, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python3.10/http/client.py", line 631, in _safe_read
data = self.fp.read(amt)
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.10/ssl.py", line 1303, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.10/ssl.py", line 1159, in read
return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/content/colabfold/batch.py", line 1465, in run
= get_msa_and_templates(jobname, query_sequence, a3m_lines, result_dir, msa_mode, use_templates,
File "/content/colabfold/batch.py", line 776, in get_msa_and_templates
a3m_lines_mmseqs2, template_paths = run_mmseqs2(
File "/content/colabfold/colabfold.py", line 295, in run_mmseqs2
tar.extractall(path=TMPL_PATH)
File "/usr/lib/python3.10/tarfile.py", line 2286, in extractall
self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
File "/usr/lib/python3.10/tarfile.py", line 2349, in _extract_one
self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
File "/usr/lib/python3.10/tarfile.py", line 2432, in _extract_member
self.makefile(tarinfo, targetpath)
File "/usr/lib/python3.10/tarfile.py", line 2485, in makefile
copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
File "/usr/lib/python3.10/tarfile.py", line 252, in copyfileobj
buf = src.read(bufsize)
File "/usr/lib/python3.10/tarfile.py", line 526, in read
buf = self._read(size)
File "/usr/lib/python3.10/tarfile.py", line 544, in _read
buf = self.fileobj.read(self.bufsize)
File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 949, in read
data = self._raw_read(amt)
File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 872, in _raw_read
with self._error_catcher():
File "/usr/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 753, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.") from e # type: ignore[arg-type]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.

@milot-mirdita
Copy link
Collaborator

Can you please try --host-url "https://api-105.colabfold.com"

This is exactly the same MSA Server, only the route to the server is different. Please report back if this is also resulting in issues or not.

@rachitk
Copy link
Author

rachitk commented Nov 20, 2024

Thank you for the help!

I've tried running colabfold_batch with the above host-url; unfortunately, it seems like the same issue is still present (irregular timeouts, seemingly inconsistent even for the same sequence submission).

@LJStewart5
Copy link

Thanks for sharing. I got 404 error on https://api-105.colabfold.com but I'm guessing the Colab code should have a modification to use this server but I could not find a place to call this in the code . I'm not a programmer at all.

@rachitk
Copy link
Author

rachitk commented Nov 20, 2024

After running on "https://api-105.colabfold.com/" for a bit longer, I'm noticing that the timeout frequency is much higher (~90% of requests fail now, compared to what roughly seems like ~50% for the original server).

@Hidenori-Matsui
Copy link

Thanks for sharing. I have been experiencing the same problem since last Friday and have still not been able to resolve it. If you have any solutions, I would appreciate it if you could let us know.

2024-11-21 08:57:23,690 Error while fetching result from MSA server. Retrying... (1/5)
2024-11-21 08:57:23,692 Error: Response ended prematurely
2024-11-21 08:59:35,920 Error while fetching result from MSA server. Retrying... (2/5)
2024-11-21 08:59:35,921 Error: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out. 2024-11-21 09:03:12,537 Error while fetching result from MSA server. Retrying... (3/5)
2024-11-21 09:03:12,538 Error: Response ended prematurely

@milot-mirdita
Copy link
Collaborator

I tried to deploy another workaround. Please let me know if there are still issues with the default api.colabfold.com. Please don't use the api-105 URL above.

@rachitk
Copy link
Author

rachitk commented Nov 22, 2024

After running a bit more on the default server, it seems to be a bit more consistent now, though I do still have intermittent timeouts/failures. I would say about 80-90% of requests succeed now using the default api.colabfold.com host.

@LJStewart5
Copy link

I'm not sure this will help. the error code below results only when I select PDB100 for template_mode . The problem is that this crashes every run I try to make. As such I can't reproduce a PPI that was highly confident before what ever change happened circa Nov 11-20.

IndexError Traceback (most recent call last)
in <cell line: 12>()
10 show_mainchains = False #@param {type:"boolean"}
11
---> 12 tag = results["rank"][0][rank_num - 1]
13 jobname_prefix = ".custom" if msa_mode == "custom" else ""
14 pdb_filename = f"{jobname}/{jobname}{jobname_prefix}unrelaxed{tag}.pdb"

IndexError: list index out of range

@milot-mirdita
Copy link
Collaborator

The IT team didn't get back to us before the weekend. I hope we can do something about this issue on Monday.

@rachitk
Copy link
Author

rachitk commented Nov 26, 2024

As an update to this: the server seems to be timing out a lot less often now (no failures in about 10 attempts so far), though it can take a while for it to return results (over 10 minutes in one case). This is definitely preferable to the previous case where it would fail intermittently, though it does seem to be a bit slower than it used to be.

@milot-mirdita
Copy link
Collaborator

I deployed a new workaround on Sunday. I don't expect any failures anymore. The reduced speed is a bit surprising though. Is it the download speed or job throughput?

I reduced the job "token recovery" to one new job token restored every 100 seconds, compared to 90 seconds before.

@LJStewart5
Copy link

Working now ! Thanks

@rachitk
Copy link
Author

rachitk commented Nov 26, 2024

I deployed a new workaround on Sunday. I don't expect any failures anymore. The reduced speed is a bit surprising though. Is it the download speed or job throughput?

I reduced the job "token recovery" to one new job token restored every 100 seconds, compared to 90 seconds before.

It seems like it's mostly the speed, though I'm not totally sure. Logs seem to imply that a query was sent and that the server is processing for quite a bit of time (though I'm not sure how to interpret the pending/running commands here). This takes longer than the job token being restored, though it may be a combination of factors.

Example below:

2024-11-25 23:37:36,771 Running colabfold 1.5.5
2024-11-25 23:37:37,097 Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA Interpreter
2024-11-25 23:37:37,097 Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
2024-11-25 23:37:40,135 Running on GPU
2024-11-25 23:37:40,956 Found 9 citations for tools or databases
2024-11-25 23:37:40,957 Query 1/1: 59fddda31a2d51d73ea75c4f964d376d23d009d9cad15e4a7fc94419dcc25927 (length 611)
2024-11-25 23:37:41,408 Sleeping for 10s. Reason: PENDING
2024-11-25 23:37:51,846 Sleeping for 9s. Reason: RUNNING

--snipped out many "RUNNING" lines for brevity--

2024-11-25 23:45:31,755 Sleeping for 9s. Reason: RUNNING
2024-11-25 23:46:04,690 Sequence 0 found templates: ['6hbu_A', '7oji_B', '7oj8_A', '8bht_B', '6ffc_A', '8bi0_B', '6feq_A', '6hco_B', '7nez_B', '6vxh_B', '5nj3_B', '7r8e_A', '7r8d_B', '7jr7_B', '7r8c_A', '7oz1_A', '7fdv_A', '7p06_A', '7p06_A', '7p03_A']
2024-11-25 23:46:08,626 Setting max_seq=512, max_extra_seq=5120
2024-11-25 23:48:29,534 alphafold2_ptm_model_1_seed_009 recycle=0 pLDDT=77.7 pTM=0.767

@milot-mirdita
Copy link
Collaborator

Thanks for the eagle eyes @rachitk
I accidentally reduced the number of CPU threads per job from 2 to 1. this only has an effect on predictions with multiple chains. I’ll restore the old behavior tomorrow

@milot-mirdita
Copy link
Collaborator

Ah wait, this is a monomer. This shouldn't be affected.

This is indeed suprisingly slow. I would have expected something in the order of ~1 minute to process this (total time spent in RUNNING).

@rachitk
Copy link
Author

rachitk commented Dec 2, 2024

Sorry for the delay in responding - was out for the holidays in the US. I've submitted a few new (monomer) jobs since the last comment here and it seems that none of them have failed and that most of the queries return within 2-3 minutes (usually <1 minute) now. Generally, the performance seems to be back to normal. Happy to run any additional tests if needed, but I think the underlying issues (intermittent failures and later slower responses) have largely been resolved from my tests.

Thank you so much for all the help and for deploying a fix so quickly re: the intermittent failures!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants