|
| 1 | +--- |
| 2 | +title: "Troubleshooting" |
| 3 | +output: |
| 4 | + rmarkdown::html_vignette |
| 5 | +vignette: > |
| 6 | + %\VignetteIndexEntry{Troubleshooting} |
| 7 | + %\VignetteEngine{knitr::rmarkdown} |
| 8 | + %\VignetteEncoding{UTF-8} |
| 9 | +--- |
| 10 | + |
| 11 | +```{css echo=FALSE} |
| 12 | +img { |
| 13 | + border: 0px !important; |
| 14 | + margin: 2em 2em 2em 2em !important; |
| 15 | +} |
| 16 | +code { |
| 17 | + border: 0px !important; |
| 18 | +} |
| 19 | +``` |
| 20 | + |
| 21 | +```{r echo=FALSE, results="hide"} |
| 22 | +knitr::opts_chunk$set( |
| 23 | + cache = FALSE, |
| 24 | + echo = TRUE, |
| 25 | + collapse = TRUE, |
| 26 | + comment = "#>" |
| 27 | +) |
| 28 | +options(clustermq.scheduler = "local") |
| 29 | +suppressPackageStartupMessages(library(clustermq)) |
| 30 | +``` |
| 31 | + |
| 32 | +## Installation errors {#install} |
| 33 | + |
| 34 | +To compile this package a fully C++11 compliant compiler is required. This is |
| 35 | +[implicit for CRAN packages](https://www.tidyverse.org/blog/2023/03/cran-checks-compiled-code/) |
| 36 | +since `R=3.6.2` and is hence not listed in _SystemRequirements_. If you |
| 37 | +encounter an error saying that that no matching function call to |
| 38 | +`zmq::message_t::message_t(std::string&)` exists, your compiler does not |
| 39 | +(fully) support this. |
| 40 | + |
| 41 | +```{sh eval=FALSE} |
| 42 | +In file included from CMQMaster.cpp:2:0: |
| 43 | +CMQMaster.h: In member function ‘void CMQMaster::proxy_submit_cmd(SEXP, int)’: |
| 44 | +CMQMaster.h:146:40: error: no matching function for call to ‘zmq::message_t::message_t(std::string&)’ |
| 45 | + mp.push_back(zmq::message_t(cur)); |
| 46 | +``` |
| 47 | + |
| 48 | +This happens for instance for old versions of the `gcc` compiler (default on |
| 49 | +most Linux distributions). You can check your version in the terminal using: |
| 50 | + |
| 51 | +```{sh eval=FALSE} |
| 52 | +# the minimum required gcc version is 5.5 for full C++11 support (3.3 for clang) |
| 53 | +cc --version |
| 54 | +``` |
| 55 | + |
| 56 | +In this case, it is _very_ likely that your HPC system already has a newer |
| 57 | +compiler installed that you need to add to your `$PATH` or load as a module. |
| 58 | +Once this is set, you can install the package from R *that was started in a |
| 59 | +terminal that has this module/path active*. |
| 60 | + |
| 61 | +## Job submission fails with template error {#template} |
| 62 | + |
| 63 | +If you test your setup with a simple `Q` call, but get an error like the |
| 64 | +following: |
| 65 | + |
| 66 | +```{r eval=FALSE} |
| 67 | +> clustermq::Q(identity, x=1, n_jobs=1) |
| 68 | +Submitting 1 worker jobs (ID: cmq6053) ... |
| 69 | +Unable to run job: unknown resource "m_mem_free" |
| 70 | +Exiting. |
| 71 | +
|
| 72 | +Your filled job submission template was: |
| 73 | +""" |
| 74 | +#$ -N cmq6053 |
| 75 | +#$ -j y |
| 76 | +#$ -o /dev/null |
| 77 | +#$ -cwd |
| 78 | +#$ -V |
| 79 | +#$ -t 1-1 |
| 80 | +#$ -pe smp 1 |
| 81 | +#$ -l m_mem_free=1073741824 |
| 82 | +
|
| 83 | +ulimit -v $(( 1024 * 4096 )) |
| 84 | +CMQ_AUTH=xxxx R --no-save --no-restore -e 'clustermq:::worker("tcp://10.0.0.100:6053")' |
| 85 | +""" |
| 86 | +
|
| 87 | +see: https://mschubert.github.io/clustermq/articles/userguide.html#trouble-template |
| 88 | +
|
| 89 | +Error in initialize(...) : Job submission failed with error code 1 |
| 90 | +In addition: Warning message: |
| 91 | +In system2("qsub", input = filled, stdout = TRUE) : |
| 92 | + running command ''qsub' < '/tmp/RtmpdGJOxs/filee7f3011007b'' had status 1 |
| 93 | +``` |
| 94 | + |
| 95 | +This means that your job submission system has not been successfully |
| 96 | +auto-detected and requires some configuration, _i.e._ you need to manually |
| 97 | +specify what scheduler and template you want to use. |
| 98 | + |
| 99 | +Be sure to know which scheduler your HPC provides, and then see the [manual |
| 100 | +scheduler setup](https://mschubert.github.io/clustermq/articles/userguide.html#scheduler-templates) |
| 101 | +on how to set it up. |
| 102 | + |
| 103 | +You may need to change some scheduler-specific options in the template file for |
| 104 | +this to work, like the queue/partition name or how to request memory or |
| 105 | +runtime. Often, the error message gives you a good hint on what to change: In |
| 106 | +the example above, we are using a version of SGE where we need to use |
| 107 | +`mem_free` instead of `m_mem_free` to request memory. |
| 108 | + |
| 109 | +If it is not obvious what do change, consult the scheduler documentation or |
| 110 | +your system admins for more information. For the latter, provide them with the |
| 111 | +filled template from the error message (as this is an HPC submission template |
| 112 | +issue rather than a `clustermq` package issue). |
| 113 | + |
| 114 | +## Session gets stuck at "Running calculations" {#stuck} |
| 115 | + |
| 116 | +Your R session may be stuck at something like the following: |
| 117 | + |
| 118 | +```{r eval=FALSE} |
| 119 | +> clustermq::Q(identity, x=42, n_jobs=1) |
| 120 | +Submitting 1 worker jobs (ID: cmq8480) ... |
| 121 | +Running 1 calculations (5 objs/19.4 Kb common; 1 calls/chunk) ... |
| 122 | +``` |
| 123 | + |
| 124 | +You will see this every time your jobs are queued but not yet started. |
| 125 | +Depending on how busy your HPC is, this may take a long time. You can check the |
| 126 | +queueing status of your jobs in the terminal with _e.g._ `qstat` (SGE), `bjobs` |
| 127 | +(LSF), or `sinfo` (SLURM). |
| 128 | + |
| 129 | +If your jobs are already finished, this likely means that the `clustermq` |
| 130 | +workers can not connect to the main session. You can confirm this by passing |
| 131 | +[`log_worker=TRUE`](https://mschubert.github.io/clustermq/articles/userguide.html#debugging-workers) |
| 132 | +to `Q` and inspect the logs created in your current working directory. If they |
| 133 | +state something like: |
| 134 | + |
| 135 | +```{sh eval=FALSE} |
| 136 | +> clustermq:::worker("tcp://my.headnode:9091") |
| 137 | +2023-12-11 10:22:58.485529 | Master: tcp://my.headnode:9091 |
| 138 | +2023-12-11 10:22:58.488892 | connecting to: tcp://my.headnode:9091: |
| 139 | +Error: Connection failed after 10016 ms |
| 140 | +Execution halted |
| 141 | +``` |
| 142 | + |
| 143 | +the submitted job is indeed unable to establish a network connection with the |
| 144 | +head node. This can happen if your HPC does not allow incoming connections, but |
| 145 | +more likely happens because there are multiple network interfaces, only some of |
| 146 | +which have access to the head node. |
| 147 | + |
| 148 | +You can list the available network interfaces using the `ifconfig` command in |
| 149 | +the terminal. Find the interface that shares a subnetwork with the head node |
| 150 | +and add the [R option](https://mschubert.github.io/clustermq/articles/userguide.html#options) `clustermq.host=<interface>`. If this is |
| 151 | +unclear, contact your system administrators to see which interface to use. |
| 152 | + |
| 153 | +## SSH not working {#ssh} |
| 154 | + |
| 155 | +Before trying remote schedulers via SSH, make sure that the scheduler works |
| 156 | +when you first connect to the cluster and run a job from there. |
| 157 | + |
| 158 | +If the terminal is stuck at |
| 159 | + |
| 160 | +``` |
| 161 | +Connecting <user@host> via SSH ... |
| 162 | +``` |
| 163 | + |
| 164 | +make sure that each step of your SSH connection works by typing the following |
| 165 | +commands in your **local** terminal and make sure that you don't get errors or |
| 166 | +warnings in each step: |
| 167 | + |
| 168 | +```{sh eval=FALSE} |
| 169 | +# test your ssh login that you set up in ~/.ssh/config |
| 170 | +# if this fails you have not set up SSH correctly |
| 171 | +ssh <user@host> |
| 172 | +
|
| 173 | +# test port forwarding from 54709 remote to 6687 local (ports are random) |
| 174 | +# if the fails you will not be able to use clustermq via SSH |
| 175 | +ssh -R 54709:localhost:6687 <user@host> R --vanilla |
| 176 | +``` |
| 177 | + |
| 178 | +If you get an `Command not found: R` error, make sure your `$PATH` is set up |
| 179 | +correctly in your `~/.bash_profile` and/or your `~/.bashrc` (depending on your |
| 180 | +cluster config you might need either). You may also need to modify your [SSH |
| 181 | +template](https://mschubert.github.io/clustermq/articles/userguide.html#ssh-template) |
| 182 | +to load R as a module or conda environment. |
| 183 | + |
| 184 | +If you get a SSH warning or error try again with `ssh -v` to enable verbose |
| 185 | +output. If the forward itself works, run the following in your local R session |
| 186 | +(ideally also in command-line R, [not only in |
| 187 | +RStudio](https://github.com/mschubert/clustermq/issues/206)): |
| 188 | + |
| 189 | +```{r eval=FALSE} |
| 190 | +options(clustermq.scheduler = "ssh", |
| 191 | + clustermq.ssh.log = "~/ssh_proxy.log") |
| 192 | +Q(identity, x=1, n_jobs=1) |
| 193 | +``` |
| 194 | + |
| 195 | +This will create a log file *on the remote server* that will contain any errors |
| 196 | +that might have occurred during `ssh_proxy` startup. |
| 197 | + |
| 198 | +If the `ssh_proxy` startup fails on your local machine with the error |
| 199 | + |
| 200 | +``` |
| 201 | +Remote R process did not respond after 5 seconds. Check your SSH server log. |
| 202 | +``` |
| 203 | + |
| 204 | +but the server log does not show any errors, then you can try increasing the |
| 205 | +timeout: |
| 206 | + |
| 207 | +```{r eval=FALSE} |
| 208 | +options(clustermq.ssh.timeout = 30) # in seconds |
| 209 | +``` |
| 210 | + |
| 211 | +This can happen when your SSH startup template includes additional steps before |
| 212 | +starting R, such as activating a module or conda environment, or having to |
| 213 | +confirm the connection via two-factor authentication. |
| 214 | + |
| 215 | +## Running the master inside containers {#master-in-container} |
| 216 | + |
| 217 | +If your master process is inside a container, accessing the HPC scheduler is |
| 218 | +more difficult. Containers, including singularity and docker, isolate the |
| 219 | +processes inside the container from the host. The *R* process will not be able |
| 220 | +to submit a job because the scheduler cannot be found. |
| 221 | + |
| 222 | +Note that the HPC node running the master process must be allowed to submit |
| 223 | +jobs. Not all HPC systems allow compute nodes to submit jobs. If that is the |
| 224 | +case, you may need to run the master process on the login node, and discuss the |
| 225 | +issue with your system administrator. |
| 226 | + |
| 227 | +If your container is binary compatible with the host, you may be able to bind |
| 228 | +in the scheduler executable to the container. |
| 229 | + |
| 230 | +For example, PBS might look something like: |
| 231 | + |
| 232 | +```{sh eval=FALSE} |
| 233 | +#PBS directives ... |
| 234 | +
|
| 235 | +module load singularity |
| 236 | +
|
| 237 | +SINGULARITYENV_APPEND_PATH=/opt/pbs/bin |
| 238 | +singularity exec --bind /opt/pbs/bin r_image.sif Rscript master_script.R |
| 239 | +``` |
| 240 | + |
| 241 | +A working example of binding SLURM into a CentOS 7 container image from a |
| 242 | +CentOS 7 host is available at |
| 243 | +https://groups.google.com/a/lbl.gov/d/msg/singularity/syLcsIWWzdo/NZvF2Ud2AAAJ |
| 244 | + |
| 245 | +Alternatively, you can create a script that uses SSH to execute the scheduler |
| 246 | +on the login node. For this, you will need an SSH client in the container, |
| 247 | +[keys set up for password-less login](https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server), |
| 248 | +and create a script to call the scheduler on the login node via ssh (e.g. |
| 249 | +`~/bin/qsub` for SGE/PBS/Torque, `bsub` for LSF and `sbatch` for Slurm): |
| 250 | + |
| 251 | +```{sh eval=FALSE} |
| 252 | +#!/bin/bash |
| 253 | +ssh -i ~/.ssh/<your key file> ${PBS_O_HOST:-"no_host_not_in_a_pbs_job"} qsub "$@" |
| 254 | +``` |
| 255 | + |
| 256 | +Make sure the script is executable, and bind/copy it into the container |
| 257 | +somewhere on `$PATH`. Home directories are bound in by default in singularity. |
| 258 | + |
| 259 | +```{sh eval=FALSE} |
| 260 | +chmod u+x ~/bin/qsub |
| 261 | +SINGULARITYENV_APPEND_PATH=~/bin |
| 262 | +``` |
0 commit comments