-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Problem Description
Building aiter
with MAX_JOBS
results in the environment variable being silently ignored and overridden, sometimes causing the build to fail.
Operating System
linux
CPU
not relevant
GPU
None
ROCm Version
6.3.4
ROCm Component
No response
Steps to Reproduce
MAX_JOBS
is currently being overridden with an heuristic in multiple locations:
Lines 94 to 106 in 01330f6
max_num_jobs_cores = max(1, os.cpu_count() * 0.8) | |
if int(os.environ.get("MAX_JOBS", "1")) < max_num_jobs_cores: | |
import psutil | |
# calculate the maximum allowed NUM_JOBS based on free memory | |
free_memory_gb = psutil.virtual_memory().available / ( | |
1024**3 | |
) # free memory in GB | |
max_num_jobs_memory = int(free_memory_gb / 0.5) # assuming 0.5 GB per job | |
# pick lower value of jobs based on cores vs memory metric to minimize oom and swap usage during compilation | |
max_jobs = int(max(1, min(max_num_jobs_cores, max_num_jobs_memory))) | |
os.environ["MAX_JOBS"] = str(max_jobs) |
Lines 164 to 178 in 01330f6
def check_and_set_ninja_worker(): | |
max_num_jobs_cores = int(max(1, os.cpu_count() * 0.8)) | |
if int(os.environ.get("MAX_JOBS", "1")) < max_num_jobs_cores: | |
import psutil | |
# calculate the maximum allowed NUM_JOBS based on free memory | |
free_memory_gb = psutil.virtual_memory().available / ( | |
1024**3 | |
) # free memory in GB | |
max_num_jobs_memory = int(free_memory_gb / 0.5) # assuming 0.5 GB per job | |
# pick lower value of jobs based on cores vs memory metric to minimize oom and swap usage during compilation | |
max_jobs = max(1, min(max_num_jobs_cores, max_num_jobs_memory)) | |
max_jobs = str(max_jobs) | |
os.environ["MAX_JOBS"] = max_jobs |
On machines with a large number of cores, the MAX_JOBS
env variable will be ignored and will be set to min(max_num_jobs_cores, max_num_jobs_memory)
.
This causes issues when attempting to build in a resource-limited environment such as a kubernetes pod, in which the values read by psutil.virtual_memory()
and os.cpu_count()
do not reflect the values set in the pod's resource: limits
field.
This is also counter-intuitive, as as a user I'd expect MAX_JOBS
to take precedence over any job number calculation heuristic.
Suggestion
Only use heuristics to calculate the number of jobs to use when MAX_JOBS
is not set.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response