Skip to content
jhs43 edited this page Jun 4, 2024 · 68 revisions

Table of Contents

Cayuga General Information

  • Cayuga is a private cluster with restricted access to members of cayuga_xxxx projects/groups.
  • Access is restricted to connections from the Weill or Ithaca VPNs.
  • For access to the Cayuga cluster send email to [email protected]. Please include Cayuga in the subject area.
  • Login node: cayuga-login1.cac.cornell.edu -- access via ssh using public/private keys
  • Running Rocky 8.5 and built with OpenHPC 2 and Slurm 20.11.9
  • Cluster networking: EDR Infiniband
  • New users might find the Getting Started on Cayuga information helpful

Hardware

  • Qty 1: A100 GPU node (g0001), with each node containing:
    CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=1024000
    GPU [0-3]: NVIDIA A100 80GB PCIe 
     
  • Qty 2: A40 GPU nodes each with (g000[2-3]), with each node containing:
    CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=1024000
    GPU [0-3]: NVIDIA A40: 48GB PCIe
  • Qty 11: CPU nodes (hyperthreading ON) (c00[01-11]), with each node containing:
    CPUs=112 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=2 RealMemory=768000

Networking

EDR Infiniband

Storage

Home Directories

  • Path: ~ OR $HOME OR /home/fs01/<cwid>
  • Users' home directories are located on a NFS export from the Cayuga head node.
  • Most data should go in your /athena/project folder yet some smaller sets may make more sense to have in your $HOME:
    • Scripts, code, profiles, and other files and user-installed software where this is the assumed location.
    • Small datasets or low I/O applications that don't benefit from a high-performance filesystem.
    • Data rarely or never accessed from compute nodes.
    • Applications where client-side caching is important: binaries, libraries, virtual/conda environments, Singularity containers (unless staging to /tmp on compute nodes is feasible).
  • Data in users' home directories are NOT backed up; users are responsible for backing up their own data.

Athena

  • Parallel File System (3.8P)
  • Each of the cayuga projects will have a setup: /athena/cayuga_####/scratch/[cwid]
  • There is also a symlink from the labname per cayuga project: /athena/[labname] --> /athena/cayuga_####

Copying data to cayuga-login1

  • Recommended method of transferring files to the cayuga endpoint https://www.cac.cornell.edu/TechDocs/files/FileTransferGlobus/
  • If you are using rsync to copy your data to the cayuga cluster, you do need to use your key just as you need to for login. An example rsync line:
    • rsync -avhP –progress /Path_to_FromDir_Data -e "ssh -i .ssh/your_cayuga_key" [cwid]@cayuga-login1.cac.cornell.edu:/athena/[labname]/scratch/[cwid]/

Scheduler

Partitions (Queues)

There are currently 2 partitions on the cayuga cluster that everyone can submit to at this time:

  • scu-cpu: PartitionName=scu-cpu Nodes=c000[1-9] Default=YES MaxTime=7-0
  • scu-gpu: PartitionName=scu-gpu Nodes=g000[1-3] Default=NO MaxTime=7-0
  • Access to the above partitions is regulated through a slurm fairshare system.

Requesting GPUs

  • To request specific numbers of GPUs (either a40 or a100), you should add your request to your srun/sbatch:
   for the a40:   --gres=gpu:a40:<# of requested GPUs>
   for the a100:  --gres=gpu:a100:<# of requested GPUs>
example to have two a40 gpus assigned to your bash session
[cayuga-login1 ~]$ srun -p scu-gpu --gres=gpu:a40:2 --pty bash
bash-4.4$ hostname
g0002
bash-4.4$ nvidia-smi
Wed Aug 30 15:46:06 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                      On | 00000000:17:00.0 Off |                    0 |
|  0%   28C    P8               27W / 300W|      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                      On | 00000000:65:00.0 Off |                    0 |
|  0%   27C    P8               30W / 300W|      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Job Containment

Hardware Features

  • If you want your job to run on specific hardware types, you can specify constraints with -C.

QOS

Priority

Resource Limits

User-Installed Software

A lot of software can be installed in a user's $HOME directory without root access, or is easy to build from source. Please check for such options, as well as the virtual environment and container solutions described below before requesting system-wide software installation (unless there are licensing issues).

Python Virtual Environments (venv)

Users can manage their own python environment (including installing needed modules) using virtual environments. Please see the documentation on virtual environments on python.org for details.

Anaconda (Miniconda)

NOTE: Consider starting with Miniconda if you do not need a multitude of packages for it will be smaller, faster to install as well as update.

example: mkdir -p /athena/cayuga_0001/scratch/jhs3001/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh bash /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh -b -u -p /athena/cayuga_0001/scratch/jhs3001/miniconda3 rm -rf /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh /athena/cayuga_0001/scratch/jhs3001/miniconda3/bin/conda init bash logout and back in, or type: source .bashrc You may also need to run: conda update -n base -c defaults conda
Create Virtual Environment

You can create as many virtual environments, each in their own directory, as needed.

  • python3.9: python3.9 -m venv <your virtual environment directory>
Activate Virtual Environment

You need to activate a virtual environment before using it:

source <your virtual environment directory>/bin/activate

Once such an environment is activated, both python and python3 should become aliases for python3.9.

Install Python Modules Using pip

After activating your virtual environment, you can now install python modules for the activated environment:

  • It's always a good idea to update pip first:
pip install --upgrade pip
  • Install the module:
pip install <module name>
  • List installed python modules in the environment:
pip list modules

Singularity Containers

Singularity is a container system similar to Docker, but suitable for running in HPC environments without root access. You might want to use Singularity if:

  • You're using software or dependencies designed for a different Linux distribution or version than the one on Cayuga.
  • Your software is easy to install using a Linux distribution's packaging system, which would require root access.
  • There's a Docker or Singularity image available from a registry like Nvidia NGC, Docker Hub, Singularity Hub, or from another cluster user with the software you need.
  • There's a Dockerfile or Singularity recipe that is close to what you need with a few modifications.
  • You want a reproducible software environment on different systems, or for publication.
Singularity is provided as an environment module; to use it first run module load singularity.

Download an existing image with singularity pull, which doesn't require root access. If multiple people will use the same image, we can publish them in a shared location.

Build a new image with singularity build, which usually must be run on an outside machine where you have root access. Then you can upload it directly to the cluster to run, or transfer it through a container registry.

Run software in the container using singularity run,singularity exec, or singularity shell. Performance will likely be best if you copy the image to local disk on the compute node, or run it from the HOME filesystem (not BeeGFS). Remember that the container potentially works like a different OS distribution and software stack, even though it can access some host filesystems by default (such as HOME), so be careful about interactions with your existing environment (shell startup files, lmod, Anaconda, venv, etc.). Consider using the -c option or maintaining an environment specific to each container you use.

Software List

Software Module to set PATH Notes
Matlab 2023 module load matlab/2023
Rstudio 2023 4.2.1 module load rstudio

Using rstudio on the cluster example

  • Add to or create a new file on your laptop to add the login nodes and all hosts your may use with rstudio: ~/.ssh/config
 Host cayuga-login1
   Hostname cayuga-login1.cac.cornell.edu
   IdentityFile ~/.ssh/your_cayuga_key
   User your_cwid
 Host c0001
   Hostname c0001
   IdentityFile ~/.ssh/your_cayuga_key
   User your_cwid
 Host c0002
   Hostname c0002
   IdentityFile ~/.ssh/your_cayuga_key
   User your_cwid
  • Once the above has been added to your ~/.ssh/config file, login to cayuga-login1.cac.cornell.edu with your key
  • type: module load rstudio
  • type: rstudio_run
  • Follow the output.
  i.e. Let's say you are put on c0002, open a new terminal window on your laptop and type: 
     ssh -J your_cwid@cayuga-login1 -NL [port#]:localhost:[port#] your_cwid@c0002

B. If not connected to VPN: ***Option B will only work if you previously had an account on the Greenberg cluster(aphrodite/pascal)***

  • Once the forwarding is setup (your above terminal window on your laptop will appear like it is hanging), bring up a browser on your local box with: http://localhost:[port#]
  • login with your cwid and password that was provided upon running rstudio_run
  • When done using RStudio, terminate the job by:
 * Exit the RStudio Session ("power" button in the top right corner of the RStudio window)
 * Issue the following command on the login node:
      scancel -f [your_job_id] 

Using jupyter notebook on the cluster example

  • Setup .ssh/config on your laptop:
- Remove any section you have for 'Host' that has c000[1-9] - Replace with: (or add to your .ssh/config)
 Host c000*
   IdentityFile ~/.ssh/[your_key_name]
   User [your_cwid]
   ProxyCommand ssh -i .ssh/[your_key_name] -W %h:%p cayuga-login1
  • ssh -i .ssh/[your_key_name] [your_cwid]@cayuga-login1
  • from cayuga-login1:
  module load anaconda3
  srun --pty -n1 --mem=8G -p scu-cpu /bin/bash -i
  • type: hostname (to see what compute node you were put on)
  • Lets say you were put on c0001:
   type: jupyter notebook --no-browser --ip 0.0.0.0 --port=8962
  • back on mac xterm:
  ssh -NL 127.0.0.1:8962:c0001:8962 c0001
  • in browser on mac:
   put in the address for the line that starts with 'http://127.0.0.1:8962/?token= into your browser. Mine looked like:
  http://127.0.0.1:8962/?token=dd21318d568114149b7b169fad09466fc8683b5b1773fd0e
   (yanked & pasted from when jupyter notebook was started)

Environment Modules (Lmod)

Set up your working environment for each software package using the module command. The module command will activate dependent modules if there are any.

  • Show all available modules:
  module avail
  • Show currently loaded modules:
  module list
  • Load a module:
  module load [software_name/version] (as show in the output of module avail)
  • Unload a module:
  module unload [software_name/version]

It is possible to create your own personal modulefiles to support the software and settings you use.

Software Installation Requests

Software will generally only be installed system-wide if:

  • It's a simple package install from a standard repository (Rocky, EPEL, OpenHPC-2) with no likely conflicts.
    • Try dnf search as a quick check for package availability.
  • It's required for licensing reasons, subject to additional direct approval by the cluster owner (potentially only the license infrastructure will be installed).
  • It can't be installed by the mechanisms above, or is version-stable and widely used, with direct approval of the cluster owner.

Default Shell

Default shell for all users is bash. If you would like to request a different shell due to some tools having feature requirements, you may submit a request to have your shell changed. You will *not* be able to chsh to change it. To change your own default shell (on all CAC clusters), you do need to send a request to: [email protected] requesting a login shell change on the cayuga cluster.

Rules, Tips, and Best Practices

  • Head node. Don't run jobs on the head node, as it can make things unresponsive for other users or in the worst case take down the whole cluster. It's ok to compile code, scan files, and do other minor administrative tasks on the head node though.
  • Threads. If you have a multithreaded job, you might want to limit the number of threads to something like 1 or 2 per core reserved, or reserve one core per thread or two. The simplest way to do this is usually to use the &#45;c with a value that is double the number of threads (e.g. if you want 4 cores/task, use &#45;c 8). Slurm sees each core as being 2 CPUs due to hyperthreading, however, your program might not use hyperthreading well. Many multithreaded programs will default to the number of CPUs they see on the system, and are not aware of scheduled resources. We are now forcing CPU affinity, so jobs with too many threads should no longer interfere with other jobs (but might hurt their own performance).
  • SIMD jobs / job arrays. If you are running many instances of the same job with different data or settings, please don't just launch tons of separate jobs in a loop. Use a job array because they are easier to monitor, manage, and cancel. Be sure to set a slot limit (% notation) to avoid flooding the queue and allow others to use the resources too — as a rule of thumb, you should be using less than 25% of any in-demand resources such as GPUs. It's good for there to always be some idle resources in case someone needs to test something quickly, etc., using a small amount of resources. Also, please run such jobs at lower priority (using qos or nice) when possible, or otherwise communicate directly with other users in case there's a problem. See the ics-research/ics-cluster-script-examples repository on COECIS Github for an example.
  • Job priority / nice. Please follow the guidelines for QOS level, and use nice as applicable. Keep in mind that if everyone runs at the highest priority all the time, the priority levels will become useless. See above.
  • Storage performance.
  • Interactive use.
  • Interactive use with multiple windows.
  • Code development/testing/debugging.
  • Long-running jobs / checkpoint-and-resume.

Help