split out troubleshooting guidee

mschubert · mschubert · commit aef4f0c41964 · 2023-12-12T11:00:21.000+01:00
diff --git a/R/pool.r b/R/pool.r
@@ -5,6 +5,8 @@ loadModule("cmq_master", TRUE) # CMQMaster C++ class
 #' Provides the basic functions needed to communicate between machines
 #' This should abstract most functions of rZMQ so the scheduler
 #' implementations can rely on the higher level functionality
+#'
+#' @keywords internal
 Pool = R6::R6Class("Pool",
     public = list(
         initialize = function(addr=sample(host()), reuse=TRUE) {
diff --git a/README.md b/README.md
@@ -53,8 +53,7 @@ remotes::install_github('mschubert/clustermq')
 
 > [!TIP]
 > For installation problems, see the
-> [Troubleshooting](https://mschubert.github.io/clustermq/articles/userguide.html#trouble-install)
-> section of the User Guide
+> [Troubleshooting guide](https://mschubert.github.io/clustermq/articles/troubleshooting.html#install)
 
 Schedulers
 ----------
@@ -77,8 +76,8 @@ schedulers](https://mschubert.github.io/clustermq/articles/userguide.html#config
 *needs* `options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)`
 
 > [!TIP]
-> You may need to [adjust the default templates](https://mschubert.github.io/clustermq/articles/userguide.html#configuration)
-> for the scheduler interface to work properly
+> Follow the links above to configure your scheduler in case it is not working
+> out of the box
 
 Usage
 -----
@@ -180,8 +179,8 @@ sizes. These include, but are not limited to:
 
 > [!TIP]
 > For any questions and issues, pleasecheck the
-> [User Guide](https://mschubert.github.io/clustermq/articles/userguide.html) and in particular the
-> [Troubleshooting section](https://mschubert.github.io/clustermq/articles/userguide.html#troubleshooting) first
+> [User](https://mschubert.github.io/clustermq/articles/userguide.html) and the
+> [Troubleshooting guide](https://mschubert.github.io/clustermq/articles/troubleshooting.html) first
 
 Citation
 --------
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -13,6 +13,8 @@ navbar:
           href: articles/userguide.html
         - text: Technical Documentation
           href: articles/technicaldocs.html
+        - text: Troubleshooting
+          href: articles/troubleshooting.html
         - text: Reference
           href: reference/index.html
         - text: Changelog
@@ -32,7 +34,6 @@ reference:
     - title: Manage worker pools
       contents:
           - workers
-          - Pool
     - title: "`foreach` support"
       contents:
           - register_dopar_cmq
diff --git a/man/Pool.Rd b/man/Pool.Rd
diff --git a/vignettes/troubleshooting.Rmd b/vignettes/troubleshooting.Rmd
@@ -0,0 +1,262 @@
+---
+title: "Troubleshooting"
+output:
+  rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Troubleshooting}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{css echo=FALSE}
+img {
+    border: 0px !important;
+    margin: 2em 2em 2em 2em !important;
+}
+code {
+    border: 0px !important;
+}
+```
+
+```{r echo=FALSE, results="hide"}
+knitr::opts_chunk$set(
+    cache = FALSE,
+    echo = TRUE,
+    collapse = TRUE,
+    comment = "#>"
+)
+options(clustermq.scheduler = "local")
+suppressPackageStartupMessages(library(clustermq))
+```
+
+## Installation errors {#install}
+
+To compile this package a fully C++11 compliant compiler is required. This is
+[implicit for CRAN packages](https://www.tidyverse.org/blog/2023/03/cran-checks-compiled-code/)
+since `R=3.6.2` and is hence not listed in _SystemRequirements_. If you
+encounter an error saying that that no matching function call to
+`zmq::message_t::message_t(std::string&)` exists, your compiler does not
+(fully) support this.
+
+```{sh eval=FALSE}
+In file included from CMQMaster.cpp:2:0:
+CMQMaster.h: In member function ‘void CMQMaster::proxy_submit_cmd(SEXP, int)’:
+CMQMaster.h:146:40: error: no matching function for call to ‘zmq::message_t::message_t(std::string&)’
+         mp.push_back(zmq::message_t(cur));
+```
+
+This happens for instance for old versions of the `gcc` compiler (default on
+most Linux distributions). You can check your version in the terminal using:
+
+```{sh eval=FALSE}
+# the minimum required gcc version is 5.5 for full C++11 support (3.3 for clang)
+cc --version
+```
+
+In this case, it is _very_ likely that your HPC system already has a newer
+compiler installed that you need to add to your `$PATH` or load as a module.
+Once this is set, you can install the package from R *that was started in a
+terminal that has this module/path active*.
+
+## Job submission fails with template error {#template}
+
+If you test your setup with a simple `Q` call, but get an error like the
+following:
+
+```{r eval=FALSE}
+> clustermq::Q(identity, x=1, n_jobs=1)
+Submitting 1 worker jobs (ID: cmq6053) ...
+Unable to run job: unknown resource "m_mem_free"
+Exiting.
+
+Your filled job submission template was:
+"""
+#$ -N cmq6053
+#$ -j y
+#$ -o /dev/null
+#$ -cwd
+#$ -V
+#$ -t 1-1
+#$ -pe smp 1
+#$ -l m_mem_free=1073741824
+
+ulimit -v $(( 1024 * 4096 ))
+CMQ_AUTH=xxxx R --no-save --no-restore -e 'clustermq:::worker("tcp://10.0.0.100:6053")'
+"""
+
+see: https://mschubert.github.io/clustermq/articles/userguide.html#trouble-template
+
+Error in initialize(...) : Job submission failed with error code 1
+In addition: Warning message:
+In system2("qsub", input = filled, stdout = TRUE) :
+  running command ''qsub' < '/tmp/RtmpdGJOxs/filee7f3011007b'' had status 1
+```
+
+This means that your job submission system has not been successfully
+auto-detected and requires some configuration, _i.e._ you need to manually
+specify what scheduler and template you want to use.
+
+Be sure to know which scheduler your HPC provides, and then see the [manual
+scheduler setup](https://mschubert.github.io/clustermq/articles/userguide.html#scheduler-templates)
+on how to set it up.
+
+You may need to change some scheduler-specific options in the template file for
+this to work, like the queue/partition name or how to request memory or
+runtime. Often, the error message gives you a good hint on what to change: In
+the example above, we are using a version of SGE where we need to use
+`mem_free` instead of `m_mem_free` to request memory.
+
+If it is not obvious what do change, consult the scheduler documentation or
+your system admins for more information. For the latter, provide them with the
+filled template from the error message (as this is an HPC submission template
+issue rather than a `clustermq` package issue).
+
+## Session gets stuck at "Running calculations" {#stuck}
+
+Your R session may be stuck at something like the following:
+
+```{r eval=FALSE}
+> clustermq::Q(identity, x=42, n_jobs=1)
+Submitting 1 worker jobs (ID: cmq8480) ...
+Running 1 calculations (5 objs/19.4 Kb common; 1 calls/chunk) ...
+```
+
+You will see this every time your jobs are queued but not yet started.
+Depending on how busy your HPC is, this may take a long time. You can check the
+queueing status of your jobs in the terminal with _e.g._ `qstat` (SGE), `bjobs`
+(LSF), or `sinfo` (SLURM).
+
+If your jobs are already finished, this likely means that the `clustermq`
+workers can not connect to the main session. You can confirm this by passing
+[`log_worker=TRUE`](https://mschubert.github.io/clustermq/articles/userguide.html#debugging-workers)
+to `Q` and inspect the logs created in your current working directory. If they
+state something like:
+
+```{sh eval=FALSE}
+> clustermq:::worker("tcp://my.headnode:9091")
+2023-12-11 10:22:58.485529 | Master: tcp://my.headnode:9091
+2023-12-11 10:22:58.488892 | connecting to: tcp://my.headnode:9091:
+Error: Connection failed after 10016 ms
+Execution halted
+```
+
+the submitted job is indeed unable to establish a network connection with the
+head node. This can happen if your HPC does not allow incoming connections, but
+more likely happens because there are multiple network interfaces, only some of
+which have access to the head node.
+
+You can list the available network interfaces using the `ifconfig` command in
+the terminal. Find the interface that shares a subnetwork with the head node
+and add the [R option](https://mschubert.github.io/clustermq/articles/userguide.html#options) `clustermq.host=<interface>`. If this is
+unclear, contact your system administrators to see which interface to use.
+
+## SSH not working {#ssh}
+
+Before trying remote schedulers via SSH, make sure that the scheduler works
+when you first connect to the cluster and run a job from there.
+
+If the terminal is stuck at
+
+```
+Connecting <user@host> via SSH ...
+```
+
+make sure that each step of your SSH connection works by typing the following
+commands in your **local** terminal and make sure that you don't get errors or
+warnings in each step:
+
+```{sh eval=FALSE}
+# test your ssh login that you set up in ~/.ssh/config
+# if this fails you have not set up SSH correctly
+ssh <user@host>
+
+# test port forwarding from 54709 remote to 6687 local (ports are random)
+# if the fails you will not be able to use clustermq via SSH
+ssh -R 54709:localhost:6687 <user@host> R --vanilla
+```
+
+If you get an `Command not found: R` error, make sure your `$PATH` is set up
+correctly in your `~/.bash_profile` and/or your `~/.bashrc` (depending on your
+cluster config you might need either). You may also need to modify your [SSH
+template](https://mschubert.github.io/clustermq/articles/userguide.html#ssh-template)
+to load R as a module or conda environment.
+
+If you get a SSH warning or error try again with `ssh -v` to enable verbose
+output. If the forward itself works, run the following in your local R session
+(ideally also in command-line R, [not only in
+RStudio](https://github.com/mschubert/clustermq/issues/206)):
+
+```{r eval=FALSE}
+options(clustermq.scheduler = "ssh",
+        clustermq.ssh.log = "~/ssh_proxy.log")
+Q(identity, x=1, n_jobs=1)
+```
+
+This will create a log file *on the remote server* that will contain any errors
+that might have occurred during `ssh_proxy` startup.
+
+If the `ssh_proxy` startup fails on your local machine with the error
+
+```
+Remote R process did not respond after 5 seconds. Check your SSH server log.
+```
+
+but the server log does not show any errors, then you can try increasing the
+timeout:
+
+```{r eval=FALSE}
+options(clustermq.ssh.timeout = 30) # in seconds
+```
+
+This can happen when your SSH startup template includes additional steps before
+starting R, such as activating a module or conda environment, or having to
+confirm the connection via two-factor authentication.
+
+## Running the master inside containers {#master-in-container}
+
+If your master process is inside a container, accessing the HPC scheduler is
+more difficult. Containers, including singularity and docker, isolate the
+processes inside the container from the host. The *R* process will not be able
+to submit a job because the scheduler cannot be found.
+
+Note that the HPC node running the master process must be allowed to submit
+jobs. Not all HPC systems allow compute nodes to submit jobs. If that is the
+case, you may need to run the master process on the login node, and discuss the
+issue with your system administrator.
+
+If your container is binary compatible with the host, you may be able to bind
+in the scheduler executable to the container.
+
+For example, PBS might look something like:
+
+```{sh eval=FALSE}
+#PBS directives ...
+
+module load singularity
+
+SINGULARITYENV_APPEND_PATH=/opt/pbs/bin
+singularity exec --bind /opt/pbs/bin r_image.sif Rscript master_script.R
+```
+
+A working example of binding SLURM into a CentOS 7 container image from a
+CentOS 7 host is available at
+https://groups.google.com/a/lbl.gov/d/msg/singularity/syLcsIWWzdo/NZvF2Ud2AAAJ
+
+Alternatively, you can create a script that uses SSH to execute the scheduler
+on the login node. For this, you will need an SSH client in the container,
+[keys set up for password-less login](https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server),
+and create a script to call the scheduler on the login node via ssh (e.g.
+`~/bin/qsub` for SGE/PBS/Torque, `bsub` for LSF and `sbatch` for Slurm):
+
+```{sh eval=FALSE}
+#!/bin/bash
+ssh -i ~/.ssh/<your key file> ${PBS_O_HOST:-"no_host_not_in_a_pbs_job"} qsub "$@"
+```
+
+Make sure the script is executable, and bind/copy it into the container
+somewhere on `$PATH`. Home directories are bound in by default in singularity.
+
+```{sh eval=FALSE}
+chmod u+x ~/bin/qsub
+SINGULARITYENV_APPEND_PATH=~/bin
+```
diff --git a/vignettes/userguide.Rmd b/vignettes/userguide.Rmd