-
Notifications
You must be signed in to change notification settings - Fork 0
Running R on synergy
Many of us learn how to use R interactively on our laptops, most often along with RStudio. However, when your task is long-running or requires much memory or cores you may find yourself wondering how you might run your code on synergy. The purpose of this guide is to show you how to do just that.
The first thing to ingrain into your mind is that RStudio is not available on synergy because the point of a compute cluster is to run your jobs in batch and not interactively. But of course RStudio is not R (you know that, right?) so no problems ![]()
Ok, so you can't use RStudio. Don't panic, below is outlined a recommended workflow that will allow you to do much of your R-work in RStudio but push the heavy lifting to the cluster.
Do everything you can with your normal workflow up to the point of your long-running or resource-hungry task. Once you get to this point you'll need to identify the R objects and data files that will be required to for the big task. You can save R objects like this:
save(my_dataframe, another_list, file = "data_for_synergy.rda")
The you'll be able to load these objects in a new R session with:
load("data_for_synergy.rda")
Alternatively you may chose to write out your dataframe as a tab-delimited or csv file.
Now put the bits you'll need to for the big task into a separate R script (with .R extension). Remember this script will be running in a separate session so you'll need to include all the library() statements and you'll need to load any data. Also pay attention to other parameters such as the number of cores that you'll need to use on the cluster.
And be sure to save any output for further downstream analysis using the same save function as above.
STRONGLY RECOMMENDED: Debug your script locally with a small version of your data (if possible) to avoid debugging on cluster, which is much more time-consuming and difficult and can lead to much 😡.
❗ Don't debug on the login node! If something goes wrong or you accidentally use a ton of memory then it's better this happens on your laptop where it won't affect a dozen other users. ❗
Alright, once you've got a working version of your script you can copy it and any data needed over the cluster.
All of the below can be done in your default conda environment but it's generally recommended to use a separate conda environment for each analysis (see the wiki).
Install conda (of course, you should have done this long ago!). You can now install any R packages you'll need (Bioconductor and CRAN packages are available on conda).
For example:
conda install r-ggplot2
conda install bioconductor-dada2
The version of Bioconductor on bioconda can lag behind a bit (takes a while to rebuild all the packages) so recent and/or development packages may give you a bit of trouble.
If needed (ONLY if needed) you can still install R packages and Bioconductor packages the 'normal' way.
Make sure you have R installed: conda install r-base
After this is installed (or if you already have it installed) double check that your active version of R is the correct one: which R
Open R R and now you can use install.packages() or BiocManager::install() (note that the old biocLite is deprecated in the new version of Bioconductor, and yes you should be using the latest version).
When you install R it come bundled with the command line program Rscript which is what you'll use to run your carefully crafted script. Just put the Rscript command in your LSF batch file, like so:
<normal batch script stuff goes here, #BSUB lines, etc.>
Rscript my_amazing_code.R
There are three reasons to use the cluster for running R scripts
- Increase speed by using more cores
- The task takes too long to run on your laptop
- The data is too large to fit into your laptops memory.
You may have encountered one or more of these issues when doing your analysis. The important piece here is that you need to PAY ATTENTION to what resources you'll need so you can properly request the resources from the scheduler. For instance if you need more cores (and the function you're using is multi-threaded) be sure to request the correct amount of cores from the scheduler using the -n parameter and to use the correct value in your R script. Alternatively, your data might be to big for your puny laptop (such data. so big. wow.) and so you'll need to request enough memory, ie. -R "rusage[mem=64000]" gets you 64GB, to run your script. And of course, if your job takes a while to run be sure to set the walltime appropriately.
So everything ran ok? Alright, now you can copy any output files back to your local machine and finish up your analysis. Load any output files and continue on.
🎆 🎊 🎉 🍰 🍸 🖥
This is a recommended process to get you started using R on synergy. It's only a recommendation - find a workflow that works well for you. Just remember to be thoughtful and courteous about how you use the cluster - it's a shared resource.