Skip to content

Commit aef4f0c

Browse files
committed
split out troubleshooting guidee
1 parent 2aa3e00 commit aef4f0c

File tree

6 files changed

+279
-249
lines changed

6 files changed

+279
-249
lines changed

R/pool.r

+2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ loadModule("cmq_master", TRUE) # CMQMaster C++ class
55
#' Provides the basic functions needed to communicate between machines
66
#' This should abstract most functions of rZMQ so the scheduler
77
#' implementations can rely on the higher level functionality
8+
#'
9+
#' @keywords internal
810
Pool = R6::R6Class("Pool",
911
public = list(
1012
initialize = function(addr=sample(host()), reuse=TRUE) {

README.md

+5-6
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,7 @@ remotes::install_github('mschubert/clustermq')
5353

5454
> [!TIP]
5555
> For installation problems, see the
56-
> [Troubleshooting](https://mschubert.github.io/clustermq/articles/userguide.html#trouble-install)
57-
> section of the User Guide
56+
> [Troubleshooting guide](https://mschubert.github.io/clustermq/articles/troubleshooting.html#install)
5857
5958
Schedulers
6059
----------
@@ -77,8 +76,8 @@ schedulers](https://mschubert.github.io/clustermq/articles/userguide.html#config
7776
*needs* `options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)`
7877

7978
> [!TIP]
80-
> You may need to [adjust the default templates](https://mschubert.github.io/clustermq/articles/userguide.html#configuration)
81-
> for the scheduler interface to work properly
79+
> Follow the links above to configure your scheduler in case it is not working
80+
> out of the box
8281
8382
Usage
8483
-----
@@ -180,8 +179,8 @@ sizes. These include, but are not limited to:
180179

181180
> [!TIP]
182181
> For any questions and issues, pleasecheck the
183-
> [User Guide](https://mschubert.github.io/clustermq/articles/userguide.html) and in particular the
184-
> [Troubleshooting section](https://mschubert.github.io/clustermq/articles/userguide.html#troubleshooting) first
182+
> [User](https://mschubert.github.io/clustermq/articles/userguide.html) and the
183+
> [Troubleshooting guide](https://mschubert.github.io/clustermq/articles/troubleshooting.html) first
185184
186185
Citation
187186
--------

_pkgdown.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ navbar:
1313
href: articles/userguide.html
1414
- text: Technical Documentation
1515
href: articles/technicaldocs.html
16+
- text: Troubleshooting
17+
href: articles/troubleshooting.html
1618
- text: Reference
1719
href: reference/index.html
1820
- text: Changelog
@@ -32,7 +34,6 @@ reference:
3234
- title: Manage worker pools
3335
contents:
3436
- workers
35-
- Pool
3637
- title: "`foreach` support"
3738
contents:
3839
- register_dopar_cmq

man/Pool.Rd

+1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vignettes/troubleshooting.Rmd

+262
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
---
2+
title: "Troubleshooting"
3+
output:
4+
rmarkdown::html_vignette
5+
vignette: >
6+
%\VignetteIndexEntry{Troubleshooting}
7+
%\VignetteEngine{knitr::rmarkdown}
8+
%\VignetteEncoding{UTF-8}
9+
---
10+
11+
```{css echo=FALSE}
12+
img {
13+
border: 0px !important;
14+
margin: 2em 2em 2em 2em !important;
15+
}
16+
code {
17+
border: 0px !important;
18+
}
19+
```
20+
21+
```{r echo=FALSE, results="hide"}
22+
knitr::opts_chunk$set(
23+
cache = FALSE,
24+
echo = TRUE,
25+
collapse = TRUE,
26+
comment = "#>"
27+
)
28+
options(clustermq.scheduler = "local")
29+
suppressPackageStartupMessages(library(clustermq))
30+
```
31+
32+
## Installation errors {#install}
33+
34+
To compile this package a fully C++11 compliant compiler is required. This is
35+
[implicit for CRAN packages](https://www.tidyverse.org/blog/2023/03/cran-checks-compiled-code/)
36+
since `R=3.6.2` and is hence not listed in _SystemRequirements_. If you
37+
encounter an error saying that that no matching function call to
38+
`zmq::message_t::message_t(std::string&)` exists, your compiler does not
39+
(fully) support this.
40+
41+
```{sh eval=FALSE}
42+
In file included from CMQMaster.cpp:2:0:
43+
CMQMaster.h: In member function ‘void CMQMaster::proxy_submit_cmd(SEXP, int)’:
44+
CMQMaster.h:146:40: error: no matching function for call to ‘zmq::message_t::message_t(std::string&)’
45+
mp.push_back(zmq::message_t(cur));
46+
```
47+
48+
This happens for instance for old versions of the `gcc` compiler (default on
49+
most Linux distributions). You can check your version in the terminal using:
50+
51+
```{sh eval=FALSE}
52+
# the minimum required gcc version is 5.5 for full C++11 support (3.3 for clang)
53+
cc --version
54+
```
55+
56+
In this case, it is _very_ likely that your HPC system already has a newer
57+
compiler installed that you need to add to your `$PATH` or load as a module.
58+
Once this is set, you can install the package from R *that was started in a
59+
terminal that has this module/path active*.
60+
61+
## Job submission fails with template error {#template}
62+
63+
If you test your setup with a simple `Q` call, but get an error like the
64+
following:
65+
66+
```{r eval=FALSE}
67+
> clustermq::Q(identity, x=1, n_jobs=1)
68+
Submitting 1 worker jobs (ID: cmq6053) ...
69+
Unable to run job: unknown resource "m_mem_free"
70+
Exiting.
71+
72+
Your filled job submission template was:
73+
"""
74+
#$ -N cmq6053
75+
#$ -j y
76+
#$ -o /dev/null
77+
#$ -cwd
78+
#$ -V
79+
#$ -t 1-1
80+
#$ -pe smp 1
81+
#$ -l m_mem_free=1073741824
82+
83+
ulimit -v $(( 1024 * 4096 ))
84+
CMQ_AUTH=xxxx R --no-save --no-restore -e 'clustermq:::worker("tcp://10.0.0.100:6053")'
85+
"""
86+
87+
see: https://mschubert.github.io/clustermq/articles/userguide.html#trouble-template
88+
89+
Error in initialize(...) : Job submission failed with error code 1
90+
In addition: Warning message:
91+
In system2("qsub", input = filled, stdout = TRUE) :
92+
running command ''qsub' < '/tmp/RtmpdGJOxs/filee7f3011007b'' had status 1
93+
```
94+
95+
This means that your job submission system has not been successfully
96+
auto-detected and requires some configuration, _i.e._ you need to manually
97+
specify what scheduler and template you want to use.
98+
99+
Be sure to know which scheduler your HPC provides, and then see the [manual
100+
scheduler setup](https://mschubert.github.io/clustermq/articles/userguide.html#scheduler-templates)
101+
on how to set it up.
102+
103+
You may need to change some scheduler-specific options in the template file for
104+
this to work, like the queue/partition name or how to request memory or
105+
runtime. Often, the error message gives you a good hint on what to change: In
106+
the example above, we are using a version of SGE where we need to use
107+
`mem_free` instead of `m_mem_free` to request memory.
108+
109+
If it is not obvious what do change, consult the scheduler documentation or
110+
your system admins for more information. For the latter, provide them with the
111+
filled template from the error message (as this is an HPC submission template
112+
issue rather than a `clustermq` package issue).
113+
114+
## Session gets stuck at "Running calculations" {#stuck}
115+
116+
Your R session may be stuck at something like the following:
117+
118+
```{r eval=FALSE}
119+
> clustermq::Q(identity, x=42, n_jobs=1)
120+
Submitting 1 worker jobs (ID: cmq8480) ...
121+
Running 1 calculations (5 objs/19.4 Kb common; 1 calls/chunk) ...
122+
```
123+
124+
You will see this every time your jobs are queued but not yet started.
125+
Depending on how busy your HPC is, this may take a long time. You can check the
126+
queueing status of your jobs in the terminal with _e.g._ `qstat` (SGE), `bjobs`
127+
(LSF), or `sinfo` (SLURM).
128+
129+
If your jobs are already finished, this likely means that the `clustermq`
130+
workers can not connect to the main session. You can confirm this by passing
131+
[`log_worker=TRUE`](https://mschubert.github.io/clustermq/articles/userguide.html#debugging-workers)
132+
to `Q` and inspect the logs created in your current working directory. If they
133+
state something like:
134+
135+
```{sh eval=FALSE}
136+
> clustermq:::worker("tcp://my.headnode:9091")
137+
2023-12-11 10:22:58.485529 | Master: tcp://my.headnode:9091
138+
2023-12-11 10:22:58.488892 | connecting to: tcp://my.headnode:9091:
139+
Error: Connection failed after 10016 ms
140+
Execution halted
141+
```
142+
143+
the submitted job is indeed unable to establish a network connection with the
144+
head node. This can happen if your HPC does not allow incoming connections, but
145+
more likely happens because there are multiple network interfaces, only some of
146+
which have access to the head node.
147+
148+
You can list the available network interfaces using the `ifconfig` command in
149+
the terminal. Find the interface that shares a subnetwork with the head node
150+
and add the [R option](https://mschubert.github.io/clustermq/articles/userguide.html#options) `clustermq.host=<interface>`. If this is
151+
unclear, contact your system administrators to see which interface to use.
152+
153+
## SSH not working {#ssh}
154+
155+
Before trying remote schedulers via SSH, make sure that the scheduler works
156+
when you first connect to the cluster and run a job from there.
157+
158+
If the terminal is stuck at
159+
160+
```
161+
Connecting <user@host> via SSH ...
162+
```
163+
164+
make sure that each step of your SSH connection works by typing the following
165+
commands in your **local** terminal and make sure that you don't get errors or
166+
warnings in each step:
167+
168+
```{sh eval=FALSE}
169+
# test your ssh login that you set up in ~/.ssh/config
170+
# if this fails you have not set up SSH correctly
171+
ssh <user@host>
172+
173+
# test port forwarding from 54709 remote to 6687 local (ports are random)
174+
# if the fails you will not be able to use clustermq via SSH
175+
ssh -R 54709:localhost:6687 <user@host> R --vanilla
176+
```
177+
178+
If you get an `Command not found: R` error, make sure your `$PATH` is set up
179+
correctly in your `~/.bash_profile` and/or your `~/.bashrc` (depending on your
180+
cluster config you might need either). You may also need to modify your [SSH
181+
template](https://mschubert.github.io/clustermq/articles/userguide.html#ssh-template)
182+
to load R as a module or conda environment.
183+
184+
If you get a SSH warning or error try again with `ssh -v` to enable verbose
185+
output. If the forward itself works, run the following in your local R session
186+
(ideally also in command-line R, [not only in
187+
RStudio](https://github.com/mschubert/clustermq/issues/206)):
188+
189+
```{r eval=FALSE}
190+
options(clustermq.scheduler = "ssh",
191+
clustermq.ssh.log = "~/ssh_proxy.log")
192+
Q(identity, x=1, n_jobs=1)
193+
```
194+
195+
This will create a log file *on the remote server* that will contain any errors
196+
that might have occurred during `ssh_proxy` startup.
197+
198+
If the `ssh_proxy` startup fails on your local machine with the error
199+
200+
```
201+
Remote R process did not respond after 5 seconds. Check your SSH server log.
202+
```
203+
204+
but the server log does not show any errors, then you can try increasing the
205+
timeout:
206+
207+
```{r eval=FALSE}
208+
options(clustermq.ssh.timeout = 30) # in seconds
209+
```
210+
211+
This can happen when your SSH startup template includes additional steps before
212+
starting R, such as activating a module or conda environment, or having to
213+
confirm the connection via two-factor authentication.
214+
215+
## Running the master inside containers {#master-in-container}
216+
217+
If your master process is inside a container, accessing the HPC scheduler is
218+
more difficult. Containers, including singularity and docker, isolate the
219+
processes inside the container from the host. The *R* process will not be able
220+
to submit a job because the scheduler cannot be found.
221+
222+
Note that the HPC node running the master process must be allowed to submit
223+
jobs. Not all HPC systems allow compute nodes to submit jobs. If that is the
224+
case, you may need to run the master process on the login node, and discuss the
225+
issue with your system administrator.
226+
227+
If your container is binary compatible with the host, you may be able to bind
228+
in the scheduler executable to the container.
229+
230+
For example, PBS might look something like:
231+
232+
```{sh eval=FALSE}
233+
#PBS directives ...
234+
235+
module load singularity
236+
237+
SINGULARITYENV_APPEND_PATH=/opt/pbs/bin
238+
singularity exec --bind /opt/pbs/bin r_image.sif Rscript master_script.R
239+
```
240+
241+
A working example of binding SLURM into a CentOS 7 container image from a
242+
CentOS 7 host is available at
243+
https://groups.google.com/a/lbl.gov/d/msg/singularity/syLcsIWWzdo/NZvF2Ud2AAAJ
244+
245+
Alternatively, you can create a script that uses SSH to execute the scheduler
246+
on the login node. For this, you will need an SSH client in the container,
247+
[keys set up for password-less login](https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server),
248+
and create a script to call the scheduler on the login node via ssh (e.g.
249+
`~/bin/qsub` for SGE/PBS/Torque, `bsub` for LSF and `sbatch` for Slurm):
250+
251+
```{sh eval=FALSE}
252+
#!/bin/bash
253+
ssh -i ~/.ssh/<your key file> ${PBS_O_HOST:-"no_host_not_in_a_pbs_job"} qsub "$@"
254+
```
255+
256+
Make sure the script is executable, and bind/copy it into the container
257+
somewhere on `$PATH`. Home directories are bound in by default in singularity.
258+
259+
```{sh eval=FALSE}
260+
chmod u+x ~/bin/qsub
261+
SINGULARITYENV_APPEND_PATH=~/bin
262+
```

0 commit comments

Comments
 (0)