Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submission in the notebook doesn't work and no errors are given. #1969

Open
WilliamDoman opened this issue Aug 23, 2024 · 1 comment
Open

Comments

@WilliamDoman
Copy link

WilliamDoman commented Aug 23, 2024

Question.

I'm trying to learn to train a vision model and azure machine learning workspace notebooks.

I am trying to create an environment where i can run both Azure AI SK2 and pytourch to train a vision model and have access to data assets in both the notebook and the remote compute.

When I run my environment i can see the versions of packages are all correct.

The problem is that the notebook with my environment and kernel won't submit the job, but no errors and if i switch to the built in Python 3.10 - SDK V2 kernel it submits.

# Define the command job
job = command(
    code="./",  # Path to your training script
    command="python trainV2.py",  # Adjust to your script name
    inputs={
        "train_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}train_val_list_v2.txt"),
        "test_data": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}test_list_v2.txt"),
        "labels": Input(type=AssetTypes.URI_FILE, path=f"{dataset.path}Data_Entry_2017.csv"),
        "images": Input(type=AssetTypes.URI_FOLDER, path=f"{dataset.path}images")
    },
    outputs = {
        "outputFolder" : Output(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RW_MOUNT)
    },
    environment=environment,
    compute=compute_cluster_name,
    instance_count=1,
    display_name="exp",
    experiment_name="exp"
)

# Submit the job
results = ml_client.jobs.create_or_update(job)

The results i get in my environment.

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration Warning: the provided asset name 'ENV-Torch2_2-Cuda12_1_SDK2' will not be used for anonymous registration

But if i runt he same code with the default Python 3.10 - SDK V2 kernel i get the same output but an additional line.

Uploading Exp (0.11 MBs): 100%|██████████| 107858/107858 [00:00<00:00, 970196.92it/s]

My environment configuration is using a standard image and adding to the requirements.txt the packages. I've done hundreds of versions of this but this is basically the latest rendition.

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202408.2

# Install pip dependencies
COPY requirements.txt .

#RUN pip install scikit-build==0.16.7 --no-cache-dir
RUN pip install -r requirements.txt --no-cache-dir

# Inference requirements
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=500
EXPOSE 5001 8883 8888

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update
RUN apt-get install -y openssh-server openssh-client

With this in requirements.txt

# Azure ML SDK v2 packages
azure-ai-ml==1.16.1
azure-core==1.30.2
azure-identity==1.17.1
azure-storage-blob==12.22.0
azure-storage-file-datalake==12.16.0

# PyTorch and related packages
torch==2.2.2  # Match the internal version if necessary
torch-nebula==0.16.13  # If needed, otherwise omit
torch-ort==1.17.0  # If needed, otherwise omit
torchaudio==2.2.2+cu121
torchdata==0.7.1
torchmetrics==1.2.0
torch-tb-profiler==0.4.3
torchvision==0.17.2+cu121

# Core scientific packages
numpy>=1.23.0,<2.0    # ==1.23.0
pandas==1.5.0
#scikit-image>=0.21.0
#SimpleITK==2.1.0
matplotlib==3.5.0
pydicom==2.3.0
pybind11==2.13.4
regex==2024.7.24

# Data handling and serialization
pyarrow==14.0.2  # Match the version in the successful environment
fsspec  # Match the successful environment's version ==2024.10.0

# Additional dependencies
albumentations==1.4.14  # As per your original list
mltable==1.6.1
tqdm==4.66.5
urllib3==2.2.2
cryptography==43.0.0
aiohttp==3.10.1
py-spy==0.3.12
debugpy==1.6.7.post1
ipykernel==6.29.5
tensorboard==2.17.1
psutil==5.8.0
Pillow==10.4.0
plotly==5.23.0
dcmstack==0.9.0
@nataliameira
Copy link

You can find a task in the environment that you performed and select it. Then go to logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants