Skip to content

[Issue]: build: MAX_JOBS not respected #922

@dtrifiro

Description

@dtrifiro

Problem Description

Building aiter with MAX_JOBS results in the environment variable being silently ignored and overridden, sometimes causing the build to fail.

Operating System

linux

CPU

not relevant

GPU

None

ROCm Version

6.3.4

ROCm Component

No response

Steps to Reproduce

MAX_JOBS is currently being overridden with an heuristic in multiple locations:

aiter/setup.py

Lines 94 to 106 in 01330f6

max_num_jobs_cores = max(1, os.cpu_count() * 0.8)
if int(os.environ.get("MAX_JOBS", "1")) < max_num_jobs_cores:
import psutil
# calculate the maximum allowed NUM_JOBS based on free memory
free_memory_gb = psutil.virtual_memory().available / (
1024**3
) # free memory in GB
max_num_jobs_memory = int(free_memory_gb / 0.5) # assuming 0.5 GB per job
# pick lower value of jobs based on cores vs memory metric to minimize oom and swap usage during compilation
max_jobs = int(max(1, min(max_num_jobs_cores, max_num_jobs_memory)))
os.environ["MAX_JOBS"] = str(max_jobs)

aiter/aiter/jit/core.py

Lines 164 to 178 in 01330f6

def check_and_set_ninja_worker():
max_num_jobs_cores = int(max(1, os.cpu_count() * 0.8))
if int(os.environ.get("MAX_JOBS", "1")) < max_num_jobs_cores:
import psutil
# calculate the maximum allowed NUM_JOBS based on free memory
free_memory_gb = psutil.virtual_memory().available / (
1024**3
) # free memory in GB
max_num_jobs_memory = int(free_memory_gb / 0.5) # assuming 0.5 GB per job
# pick lower value of jobs based on cores vs memory metric to minimize oom and swap usage during compilation
max_jobs = max(1, min(max_num_jobs_cores, max_num_jobs_memory))
max_jobs = str(max_jobs)
os.environ["MAX_JOBS"] = max_jobs

On machines with a large number of cores, the MAX_JOBS env variable will be ignored and will be set to min(max_num_jobs_cores, max_num_jobs_memory).

This causes issues when attempting to build in a resource-limited environment such as a kubernetes pod, in which the values read by psutil.virtual_memory() and os.cpu_count() do not reflect the values set in the pod's resource: limits field.

This is also counter-intuitive, as as a user I'd expect MAX_JOBS to take precedence over any job number calculation heuristic.

Suggestion

Only use heuristics to calculate the number of jobs to use when MAX_JOBS is not set.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions