Running R on synergy

Many of us learn how to use R interactively on our laptops, most often along with RStudio. However, when your task is long-running or requires much memory or cores you may find yourself wondering how you might run your code on synergy. The purpose of this guide is to show you how to do just that.

No RStudio

The first thing to ingrain into your mind is that RStudio is not available on synergy because the point of a compute cluster is to run your jobs in batch and not interactively. But of course RStudio is not R (you know that, right?) so no problems

Ok, so you can't use RStudio. Don't panic, below is outlined a recommended workflow that will allow you to do much of your R-work in RStudio but push the heavy lifting to the cluster.

Recommended workflow

Step 1: Do as much as possible on your local machine.

Do everything you can with your normal workflow up to the point of your long-running or resource-hungry task. Once you get to this point you'll need to identify the R objects and data files that will be required to for the big task. You can save R objects like this:

save(my_dataframe, another_list, file = "data_for_synergy.rda")

The you'll be able to load these objects in a new R session with:

load("data_for_synergy.rda")

Alternatively you may chose to write out your dataframe as a tab-delimited or csv file.

Step 2: Create a separate R script for the bit that needs to run on the cluster

Now put the bits you'll need to for the big task into a separate R script (with .R extension). Remember this script will be running in a separate session so you'll need to include all the library() statements and you'll need to load any data. Also pay attention to other parameters such as the number of cores that you'll need to use on the cluster.

And be sure to save any output for further downstream analysis using the same save function as above.

STRONGLY RECOMMENDED: Debug your script locally with a small version of your data (if possible) to avoid debugging on cluster, which is much more time-consuming and difficult and can lead to much 😡.

❗ Don't debug on the login node! If something goes wrong or you accidentally use a ton of memory then it's better this happens on your laptop where it won't affect a dozen other users. ❗

Alright, once you've got a working version of your script you can copy it and any data needed over the cluster.

Step 3: Install R and your required packages on synergy with conda

All of the below can be done in your default conda environment but it's generally recommended to use a separate conda environment for each analysis (see the wiki).

Recommended way to install R and R packages

Install conda (of course, you should have done this long ago!). You can now install any R packages you'll need (Bioconductor and CRAN packages are available on conda).

For example:

conda install r-ggplot2
conda install bioconductor-dada2

The version of Bioconductor on bioconda can lag behind a bit (takes a while to rebuild all the packages) so recent and/or development packages may give you a bit of trouble.

"Manual" way but not recommended (less reproducible)

If needed (ONLY if needed) you can still install R packages and Bioconductor packages the 'normal' way.

Make sure you have R installed: conda install r-base

After this is installed (or if you already have it installed) double check that your active version of R is the correct one: which R

Open R R and now you can use install.packages() or BiocManager::install() (note that the old biocLite is deprecated in the new version of Bioconductor, and yes you should be using the latest version).

Step 4: Use `Rscript` in your LSF batch script to run your code

When you install R it come bundled with the command line program Rscript which is what you'll use to run your carefully crafted script. Just put the Rscript command in your LSF batch file, like so:

<normal batch script stuff goes here, #BSUB lines, etc.>

Rscript my_amazing_code.R

Considerations about R and resource allocations

There are three reasons to use the cluster for running R scripts

Increase speed by using more cores
The task takes too long to run on your laptop
The data is too large to fit into your laptops memory.

You may have encountered one or more of these issues when doing your analysis. The important piece here is that you need to PAY ATTENTION to what resources you'll need so you can properly request the resources from the scheduler. For instance if you need more cores (and the function you're using is multi-threaded) be sure to request the correct amount of cores from the scheduler using the -n parameter and to use the correct value in your R script. Alternatively, your data might be to big for your puny laptop (such data. so big. wow.) and so you'll need to request enough memory, ie. -R "rusage[mem=64000]" gets you 64GB, to run your script. And of course, if your job takes a while to run be sure to set the walltime appropriately.

Step 5: Pull your result files back down to your local machine and finish your analysis

So everything ran ok? Alright, now you can copy any output files back to your local machine and finish up your analysis. Load any output files and continue on.

Step 6: Marvel at your newly acquired computing skills

🎆 🎊 🎉 🍰 🍸 🖥

Conclusion

This is a recommended process to get you started using R on synergy. It's only a recommendation - find a workflow that works well for you. Just remember to be thoughtful and courteous about how you use the cluster - it's a shared resource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running R on synergy

No RStudio

Recommended workflow

Step 1: Do as much as possible on your local machine.

Step 2: Create a separate R script for the bit that needs to run on the cluster

Step 3: Install R and your required packages on synergy with conda

Recommended way to install R and R packages

"Manual" way but not recommended (less reproducible)

Step 4: Use `Rscript` in your LSF batch script to run your code

Considerations about R and resource allocations

Step 5: Pull your result files back down to your local machine and finish your analysis

Step 6: Marvel at your newly acquired computing skills

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Uh oh!

Running R on synergy

No RStudio

Recommended workflow

Step 1: Do as much as possible on your local machine.

Step 2: Create a separate R script for the bit that needs to run on the cluster

Step 3: Install R and your required packages on synergy with conda

Recommended way to install R and R packages

"Manual" way but not recommended (less reproducible)

Step 4: Use Rscript in your LSF batch script to run your code

Considerations about R and resource allocations

Step 5: Pull your result files back down to your local machine and finish your analysis

Step 6: Marvel at your newly acquired computing skills

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Step 4: Use `Rscript` in your LSF batch script to run your code