This is an automatic job manager for running TPU jobs. It supports auto-resuming the preempted/grpc TPUs, and monitoring the jobs status.
Here is a quick guide of the common usage, and you can find more details in the full docs below.
Tldr usage in two sentences: Use tpu add-user to add your username, then go to your working directory(where you have your scripts and code) and use tpu set-cur 1 username to set the working directory. Use tpu run <tpu> username(e.g. tpu run v2-32-p2 xibo) to run the job, and use tpu monitor/check username to see the status of all your jobs. (The tpu run command will auto-resume the job when preempted/grpc for preempted TPUs, you don't have to set it.)
More usage in two sentences: Use tpu tldr to see useful commands, and tpu clear username to clear the finished/crashed jobs; use tpu -a alias_name full_name username(e.g. tpu -a lr config.training.learning_rate) to add a new alias, then you can pass the configs such as tpu run v2-32-6 xibo lr=0.01. Use tfind to search for TPUs in the spreadsheet, tpu describe <tpu> to check the environment of the TPU, and tpu solve <tpu> to solve the environment automatically.
REMEMBER TO UPDATE YOUR SCRIPTS!
To use our scripts, your repo should have the following structure
- utils
- remote_run_config.yml: includes a key called "wandb_notes" representing the notes you want to display on the spreadsheet
- ...
- just_staging.sh
- staging.sh
- run_remote.sh
- ka.sh
The scripts can be found in ZHH's repo. Contact [email protected] if you want them. The scripts uses wandb logging to detect. So make sure your code uses wandb for logging.
1. Setup(IMPORTANT)
You should update your scripts to the newest version supporting command-line arguments. The newest scripts can be pulled from zhh's repo. The current finishing check is based on wandb final output, so please make sure your scripts are using wandb to log the final output.
Also, this script is not very robust to attack, so please try not to do OOD things, for example, setting username to be run, false, v2-32-2 or Chinese characters.
Use tpu add-user and follow the instructions to add your username. Please set your username long so that it won't be the prefix of other user's to avoid errors. (e.g. tmux attach -t b may attach to bird by prefix matching)
2. Setting Working Directory & Running/Monitoring Jobs (IMPORTANT)
The working directory is the place where you have your scripts (staging.sh etc.) and code.
You can set multiple working directories and choose any of them when running code. The default working directory is 1.
You can set the working directory, see you working directory, and run the job by:
tpu set-cur num username # Set the working directory <num> to the current directory, default directory is 1
tpu ls username # List all the working directories
tpu run tpu_name username [dir=1] # Run the job in working directory <dir>The tpu_name is of the format of the pre-defined tpu aliases , like v2-32-6, v2-32-p1, or v4-32-py2. You can also pass full-name such as kmh-tpuvm-v2-32-1.
We also suppport only passing the tpu type, like v2, v3, v23(v2 or v3), v3+(v3 or v4), or v3-32, and -n and -p for normal/preemptible TPUs. If you pass only the tpu type, it will show all the TPUs of that type for you to choose interactively. Alternatively, you can pass -auto to auto-select a free TPU of that type. If there's no free TPUs, it will show all the reserved TPUs for you to choose.
For all the aliases, use tpu -lta/-sta (list/show TPU aliases) to see. You can also add aliases by tpu -ata alias FULL_TPU_NAME(add TPU aliases). Please don't add aliases that may lead to contradictions to other things, for example username or tag or config or s.
Example:
trun/tpu run v2-32-6 xibo # Run the job in working directory 1 using tpu v2-32-6
trun/tpu run v2-32-p1 lyy 2/dir=2 # Run the job in working directory 2 using tpu v2-32-p1
trun/tpu run v2-32 v3-32 -p xibo -auto # Auto-select a free preemptible TPU of type v2-32 or v3-32The tpu run command opens a monitor window to track all your jobs. Alternatively, you can use:
tm/tpu monitor usernamewhich updates the monitor window every 10 seconds. For one-time checks, use:
tcks/tpu check-simp usernameThe run command will automatically resume the preempted TPU jobs, and you can see more in section 2B or 6.More on Resuming/Rerunning.
If no tpu's are available, you can use the queue command instead of run. queue takes in a tpu type or a tpu name, and will start the job when a valid tpu finished. You can use tpu vq or tvq to see the queue, and tpu dequeue <id> <user> to delete a job in the queue.
2A. More Directory Operations (OPTIONAL)
```bash
tpu del-dir <num> username # Delete the working directory <num>
tpu swap-dir <num1> <num2> username # Swap the working directory <num1> and <num2>
```
2B. Advanced Running Settings (OPTIONAL)
The run command will ask whether to reapply when the TPU is preempted.
You can add the flag -apply to skip the prompt.
You can add the flag -q to skip the monitor window.
You can add the tag by tag=your_tag to add a tag to the job, which will be shown in the monitor window.
You can add tags to existing jobs by:tpu add-tag window_num tag_name username
You can change the default rules for resuming/rerunning by passing rule=<rule> to the tpu run command. (Default: auto-resume on GRPC errors and auto-reapply and resume when preempted for preemptible TPUs and do nothing for other TPUs (you can set rule=resume to make it resume). See more in the More on Resuming/Rerunning section.)
2C. Advanced Monitor Configs (OPTIONAL)
The monitor will show four things: the windows number(w), the directory(d), the tpu(t), and the job status(s). You can choose which to show by adding commands. There's also an additional flag "verbose"(v) available, meaning to show the messages(cut) from tmux windows even for the running jobs with known status.(Should be used with s) For example, to only show the working directory and the job status and detailed output of xibo, use:
tpu monitor xibo -dsvIf you don't want tpu run to open the monitor window, you can use tpu set-settings monitor_after_run False username to disable it. Also, you can set the default monitoring whether to monitor tpu/directory. See the Customizing User Settings section for more details.
2D. Spreadsheet Support (OPTIONAL, RECOMMENDED)
The tpu run command will automatically set the status in the spreadsheet to be running by you. If you want to set the notes, you can add a -ssn flag(short for --set-spreadsheet-notes) to set the notes interactively, or you can pass ssn="your notes" to set the notes directly. (Notice: please don't include = in the notes, which may introduce parsing errors, e.g. ssn="ssn = test") The notes set by ssn will be shown as tag in the monitor window. If you don't want it, add -no-tag flag to skip that. If no notes are provided, it will try to extract the key wandb_notes in the config file.
You can also set the notes afterwards by tpu ssn/asn <tpu> <notes>, for example tpu ssn v2-32-6 "This is a test". ssn resets the notes to be This is a test, while asn appends the notes to the current notes.
You can use tpu find <all_tpu_types> (or tfind for short) to look at the status of the TPUs in the spreadsheet. The format of tpu_types is like v2, v3, v234(or v*) or v2-32. You can also pass -n for normal TPUs and -p for preemptible TPUs. For example, to show the status of all non-preemptible v3-32 and v4 TPUs, you can do:
tpu find v3-32 v4 -n. If no v? is passed, it will show all the TPUs.
You can release the TPU by tpu release/rel <tpu_name>, which set the status and the user to be free('闲的') in the spreadsheet. You can also use tpu release/rel <tpu_name> <username> to make sure that the TPU is currently owned by you(recommended).
3. Killing Jobs/Windows & Cleaning up (USEFUL)
As you run more and more jobs, there will be a lot of tmux windows, which is messy.
You can use (recommend to do occasionally):
tpu clean username to kill all the tmux windows whose jobs are finished/error/killed.
To kill a job, use:
tpu kill/kill-job/-k/-kj w=/-w=/window=/<windows_id> usernameYou can also just enter windows_id, in this case the command will find the integer in the
arguments to be the windows id. For example you can just use tpu kill 101 xibo to kill the job with windows id 101, but passing w= is safer for future use.
Jobs with children jobs that were rerun/resumed will be killed based on the status of their children. Use tpu clean username -re to make all the rerun/resumed job be cleaned too.
IMPORTANT: If you have a job that has rerun setting, and it has been grpc, please remember to use clean to clear it if you manually kill the window, otherwise it may be rerunned.
3A. Other killing commands (OPTIONAL)
To kill a specific tmux window (NOT RECOMMENDED):tpu -kw/kill-window window_number usernameAfter killing windows, some jobs may become "zombies" (i.e., jobs without associated windows). You can use these helpers to clean zombies (Supported, but NOT RECOMMENDED):
tpu -czw username # Clear all zombie windows
tpu -czj username # Clear all zombie jobs
tpu clear-finished username # Clear all finished jobs
tpu clear-error username # Clear all error jobs
tpu clear-all username # RECOMMENDED: Clear all finished/error jobsThe clean command integrates these actions, so using kill-job + clean is strongly recommended instead of manually killing windows with tmux kill-window or exit the window yourself. (If you like to kill the window yourself, we recommend doing tpu clean username occasionally to clear the job data associated with these windows, or others may get the annoying warning of the TPU occupied by your dead jobs.)
4. Environment Operations (OPTIONAL)
We support common operations, such as:
tpu apply/reapply tpu_name # Apply/reapply the TPU; reapply deletes and recreates the TPUIf you applied or want to apply for a new tpu that is not recorded(e.g. v4-32-pre-newname), please run
tpu registerto register the new tpu in the spreadsheet, so that you can use it in tpu run command.
To delete the registration, you can use:
tpu del-registered/del-reg/del-info tpu_alias # Delete the TPU registrationEnvironment operations are also supported:
tpu test tpu_name # Test the TPU environment with commands interactively
tpu mount-disk tpu_name # Mount the disk and set up wandb for the TPU
tpu describe tpu_name # Describe the TPU environment
tpu check-status tpu_name # Check the TPU status (e.g., PREEMPTED, READY, CREATING, etc.)An automatic environment solver is available to address TPU environment issues.
Currently, it handles mounting issues, but contributions are welcome to enhance it into a powerful one-line tool for solving complex TPU environment problems you have encountered. This way, ideally we only need to manully fix every possible issue once!
tpu solve tpu_name # Integrated automatic environment solver5. Passing Configs in Command Line (OPTIONAL)
We support passing configs in the command line by config aliases or full config name. You can also set your own config alias by:
tpu -a/-alias your_alias FULL_NAME username # add/change an alias
tpu -sa username # list all the aliases
tpu del-config-alias your_alias username # delete the aliasFor example, you can do:
tpu -a lr config.training.learning_rate xiboThen:
tpu run v2-32-6 xibo lr=0.01
tpu run v2-32-6 xibo config.training.learning_rate=0.01 # This is also supportedSome default aliases
"lr": "config.training.learning_rate"
"bs": "config.training.batch_size"
"ep": "config.training.num_epochs"
"wd": "config.training.weight_decay"
"b1": "config.training.adam_b1"
"b2": "config.training.adam_b2"
"ckpt": "config.training.checkpoint_per_epoch"6. More on Resuming/Rerunning (OPTIONAL)
You can manually resume/rerun a job by:tpu resume window=<windows_id> username # resume the job
tpu resume window=<windows_id> tpu=<tpu> username # resume the job in a new TPU
tpu rerun window=<windows_id> username # rerun the job
tpu rerun window=<windows_id> tpu=<tpu> username # rerun the job in a new TPUThe difference between resume and rerun is that resume will load the job from the last checkpoint, while rerun will start a new job from the beginning.
Our default rules for resuming/rerunning are as follows:
For preempted TPUs, we will reapply the TPU and resume the job when the job is preempted, and resume the job when the job encounters a GRPC error. For non-preempted TPUs, we will not perform any operations.
You can pass the rule=<rule> to the tpu run command to set the rules. The available rules are:
reapply: Reapply when GRPC error occurs or when preempted.pass(default for non-preempted TPUs): Do nothing.rerun: Rerun when GRPC error occurs, reapply when preempted.pre(default for preempted TPUs): Reapply when GRPC error occurs, resumeresume(recommend for non-preempted TPUs, may change to default someday): Resume when GRPC error occurs, pass when preempted.
For example, if you want a job running in preempted TPUs to be rerunned instead of resumed when grpc, you can do:
tpu run v2-32-p2 xibo rule=rerunIf you want a job running in non-preempted TPUs to be resumed when grpc, you can do:
tpu run v2-32-2 xibo rule=resumeYou can see all the rules using
tpu check-rulesIf you want to know whether the job is a resumed job in the program(for example, use that to set a new wandb name/note), you can add --log-stage flag in tpu run, then it will pass an additional argument config.stage to indicate the number of resumes of this job. (For example, if the job has been resumed twice, that is, there're 3 runs in total including the current one, the current one will recieve an extra config.stage=2 config).
We have a MONITOR to occasionally keep tract of all the job status and decide whether to resume/rerun. The default checking frequency for the jobs to be rerun is about 30 mins, that is, the jobs will wait at most 30 mins to be resumed. If you run a job that leads to a GRPC immediately, you can acknowledge the MONITOR to rerun that immediately by:
tpu ackThen after no more than 3 mins you should expect the job to be resumed(if not, contact the admin).
7. Customizing User Settings (OPTIONAL)
We support customizing settings for users, and you can set/get them by:
tpu set-settings key value username # set the settings
tpu get-settings username # get the settings
tpu reset-settings username # reset all the settingsThe current default settings and their meanings are:
{
"monitor_after_run": True, # Whether to monitor the job after running
"monitor_upd_time": 5, # The update time for the monitor window
"monitor_length": 800, # The output capturing length for the monitor window to determine the job status
"monitor_dir": True, # Whether to show the working directory in the monitor window
"monitor_tpu": True, # Whether to show the TPU name in the monitor window
"monitor_verbose": False, # Whether to show the output in the monitor window when the status is known
"show_length": 200, # The output capturing length for the monitor window to show the job output
"time_zone": "us", # The user timezone, only support 'us'(UTC-4)/'cn'(UTC+8) for now.
"extra_settings": {} # The extra settings for future development
}Also, to avoid concurrency issues of tmux windows creation, we use a windows_offset to offset the windows number for each user, and the number goes up by 1 for each new job. If you think the offset is too large, you can set it to a smaller number by:
tpu reset-window-num <num> <username> # reset the offset to <num>Please be careful not to have conflicts with current jobs.
8. Documentation
tpu tldr
tpu -h command # details of the commandSome of the help of the commands are not updated, please refer to this README for the latest usage.
Code Structure
The user interface is implemented in tpu.py, and the specific function implementation is in utils/.
MONITOR.py does the check and resume work, and will be run all day, it will check the jobs and do unit tests occansionally according to data["MONITOR_config"](You can see the full format of data.json below, which is the key matadata we maintain to manage all the jobs).
We use MONITOR to referr to the global monitor process to separate it from the local monitor window for each user.
For utils/:
desciptions.pydoes all the documentation workoperate.pydoes the tpu remote operationsjobs.pydoes the job managementdirectories.pydeals with the user working dirslogger.pydoes most of the logging with meta-datahelpers.pydoes the helper functionserror_handler.pydoes the error handling worksunit_tests.pydoes the unit tests (sanity checks)sheet.pydoes the spreadsheet operationsdevelop.pydoes the developer tools, to safely modify the metadata and avoid conflicts with current jobs (see more in next paragraph)
Data Format
The key data is stored in data.json, and the program reads and writes it using the API in data_io.py, which implements locking (in lock.json).
The structure of data.json is as follows:
Full data.json structure
{
"users": {
"username": {
"id": 0,
"name": "username",
"tmux_name": "username",
"working_dir": {"1": "/path"},
"job_data": [],
"config_aliases": {"lr": "config.training.lr"},
"settings": {
"monitor_after_run": true,
"monitor_upd_time": 5,
"monitor_length": 800,
"monitor_verbose": false,
"monitor_dir": true,
"monitor_tpu": true,
"show_length": 300,
"time_zone": "us"
},
"windows_offset": 42,
"logs": []
}
},
"user_list": ["username"],
"id_list": [0],
"id_user_dict": {"0": "username"},
"user_id_dict": {"username": 0},
"tpu_aliases": {"v2-1": "kmh-tpuvm-v2-32-1"},
"all_tpus": {
"europe-west4-a": ["..."],
"us-central1-a": ["..."],
"us-central2-b": ["..."],
"preemptible": ["..."]
},
"monitor_config": {
"test_freq": 3600,
"checking_freq": 600
},
"wandb_api_key": "...",
"conda_env_name": "NNX",
"monitor_all_check_time": 20,
"MONITOR_logs": [],
"ack_MONITOR": false
}Each job is described as:
Full job structure
{
"user": "username",
"windows_id": 1,
"job_dir_id": 1,
"job_dir": "/your/code/path",
"tpu": "kmh-tpuvm-v2-32-preemptible-1",
"job_tags": null,
"log_dir": "/your/log/path",
"staage_dir": "/your/staging/path",
"extra_configs": "--lr=0.01",
"status": "running",
"error": null,
"stage": 0,
"monitor": true,
"rules": {
"preempted": "reapply",
"grpc": "resume"
},
"extra_msgs": {},
"start_time": "20250420_011026",
"customized_settings": {}
}- More testing/docs
- Support restarting TPU
- Customized monitor window
- Auto-choose the TPU to run a job
- More auto env solvers
- Logging for every user so that you can check the things happen since last time