Skip to content

Latest commit

 

History

History
163 lines (114 loc) · 5.81 KB

infrastructure-aurora.markdown

File metadata and controls

163 lines (114 loc) · 5.81 KB
title layout categories
Using Apache Aurora
main
infrastructure

It's now possible to use Apache Aurora to schedule long running jobs, cron jobs, and ad hoc job on the ALICE Mesos infrastructure.

ALICE Aurora instance can be found at:

https://aliaurora.cern.ch

It is used for a number of jobs an in particular for the continuous builders of the pull request.

Access is allowed to ALICE members who are part of the alice-aurora-users egroup. You can subscribe to it by going to the usual egroups page.

User guide

Getting the client

{:get-the-client}

The GUI is only a read-only view on the state of the jobs running on the cluster. If you want to interact with Aurora itself you will need the aurora client.

You can get a binary distribution of the Aurora client at:

https://github.com/alisw/aurora/releases/tag/0.16.0-alice2

Alternatively you can download the sources and build it with:

./pants binary src/main/python/apache/aurora/kerberos:kaurora
cp dist/kaurora aurora

If you use homebrew, you can also do:

brew install ktf/system-deps/alice-aurora

Configuring your environment

{:configuring}

Access is allowed to ALICE members who are part of the alice-aurora-users egroup. You can subscribe to it by going to the usual egroup page.

The authentication mechanism uses kerberos, so you should make sure you have a valid kerberos ticket in the CERN.CH realm. You can verify that by doing:

$ klist

which should result in something like:

Credentials cache: API:5BD2DB44-B9A8-48BD-9CD1-47078B7D00A9
        Principal: <your-cern-user-name>@CERN.CH

  Issued                Expires               Principal
Aug 13 14:26:57 2019  Aug 14 00:26:57 2019  krbtgt/[email protected]

In order to reach the Aurora cluster, you need to configure how to access it. This is done by creating a file ~/.aurora/clusters.json:

[{
  "name": "build",
  "scheduler_uri": "https://aliaurora.cern.ch",
  "auth_mechanism": "KERBEROS",
  "slave_run_directory": "latest",
  "slave_root": "/build/mesos"
}]

If everything is setup as expected you should be able to get a list of jobs by doing:

$ aurora job list build

If you are an admin, you should also verify that aurora_admin also works.

$ aurora_admin get_cluster_config build
{"auth_mechanism": "KERBEROS", "name": "build", "scheduler_uri": "https://aliaurora.cern.ch"}%

Run a simple application

{: simple-app}

We keep Aurora configuration files in:

https://gitlab.cern.ch/ALICEDevOps/ali-marathon

in the aurora folder. You can for example look at the "Hello world" example:

ali-marathon/aurora/hello.aurora

You can start it with:

$ aurora job create build/mesostest/devel/hello hello.aurora
  INFO] Creating job hello
  INFO] Checking status of build/mesostest/devel/hello
 Job create succeeded: job url=https://aliaurora.cern.ch/scheduler/mesostest/devel/hello

This will start on the cluster a (somewhat) long running job. You can open the provided web page to look at the workarea. If you want to interact with the job in an ad-hoc manner, e.g. to debug what it is doing or force some action to it, you can SSH to the machine running the job by doing:

$ aurora task ssh build/mesostest/devel/hello/0

which will ssh for you in the sandbox for the job on the machine it is running. Notice that you might still have to docker exec yourself into the container to get the correct environment. Alternatively you can execute a one off job by doing:

$ aurora task run build/mesostest/devel/hello/0 "hostname > foo.txt"

You can find more information about the available commands in the official Aurora documentation.

Gotchas & issues

{: gotchas}

  • On Costin machines, for security reasons the log provider are not running, so you need to directly ssh inside them and look at the filesystem.

  • On some systems, the CERN CA is not available by default. You can overcome this by either:

    • Go to https://ca.cern.ch and install all the required CA certificates. In general this is what is needed on macOS.
    • Obtain it via scp lxplus.cern.ch:/etc/ssl/certs/ca-bundle.crt ca-bundle.crt and doing export REQUESTS_CA_BUNDLE=$PWD/ca-bundle.crt.
    • Installing the CERN-CA-certs package and doing export REQUESTS_CA_BUNDLE=$PWD/ca-bundle.crt.
  • On some systems, kerberos gives a token for the actual backend name, rather than aliaurora. You can check that by doing klist and you will see HTTP/[email protected]:

Credentials cache: API:B7FC3DD4-738F-417E-B2FA-92B2CCA9590C
        Principal: [email protected]

  Issued                Expires               Principal
Aug 14 15:50:42 2019  Aug 15 01:50:42 2019  krbtgt/[email protected]
Aug 14 15:50:46 2019  Aug 15 01:50:42 2019  HTTP/[email protected]

In order to fix this you will have to change your kerberos configuration, usually found in /etc/krb5.conf, and add rdns = false in the [libdefaults] stanza.

  • On mac the most reliable way to operate is:

    • First do kdestroy
    • Then do kinit
    • Finally aurora job list
    • The three steps above should guarantee that Firefox has the correct token.
  • SSH / running ad-hoc tasks on some of the machines requires extra work, most notably on Costin's alientest machines which do not use kerberos. In order to be able to login, you need to ask the admin of such machines to add your SSH key to the .authorised_keys of the user which has the same name as the role for the job (e.g. mesostest for build/mesostest/devel/hello).