Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many files can cause the fingerprinter to fail on Windows #79

Open
lyhyl opened this issue May 31, 2023 · 6 comments
Open

Too many files can cause the fingerprinter to fail on Windows #79

lyhyl opened this issue May 31, 2023 · 6 comments

Comments

@lyhyl
Copy link
Contributor

lyhyl commented May 31, 2023

I follow the tutorial and fed my own data. The network execution finished but most of jobs/classification tasks failed.

...
 [INFO]   noderun:0592 >> Creating job for node fastr:///networks/WORC_BCMS_SY/0.0/runs/WORC_BCMS_SY_2023-05-30T21-36-06/nodelist/classification sample id <SampleId ('all',)>, index <SampleIndex (0)>
 [INFO] networkrun:0654 >> Queueing job WORC_BCMS_SY___classification___all___0
 [INFO]   noderun:0592 >> Creating job for node fastr:///networks/WORC_BCMS_SY/0.0/runs/WORC_BCMS_SY_2023-05-30T21-36-06/nodelist/performance sample id <SampleId ('all',)>, index <SampleIndex (0)>
 [INFO] networkrun:0654 >> Queueing job WORC_BCMS_SY___performance___all___0
 [INFO] networkrun:0657 >> Waiting for execution to finish...
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_im_train_MRI_0___P137996 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_im_train_MRI_0___P141824 with status JobState.finished
...
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P744962 is no longer under processing, cannot cancel!
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P745979 is no longer under processing, cannot cancel!
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P80292 is no longer under processing, cannot cancel!
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P84221 is no longer under processing, cannot cancel!
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P85705 is no longer under processing, cannot cancel!
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P8632 is no longer under processing, cannot cancel!
 [WARNING] executionplugin:0341 >> Job WORC_BCMS_SY___featureconverter_train_predict_CalcFeatures_1_0_MRI_0___P92423 is no longer under processing, cannot cancel!
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___config_classification_sink___all___0 with status JobState.finished
 [INFO] networkrun:0686 >> Chunk execution finished!
 [INFO] executionplugin:0523 >> Callback processing thread for LinearExecution ended!
 [INFO] networkrun:0688 >> ####################################
 [INFO] networkrun:0689 >> #    network execution FINISHED    #
 [INFO] networkrun:0690 >> ####################################
 [INFO] simplereport:0026 >> ===== RESULTS =====
 [INFO] simplereport:0036 >> classification: 0 success / 0 missing / 1 failed
 [INFO] simplereport:0036 >> config_MRI_0_sink: 0 success / 0 missing / 1 failed
 [INFO] simplereport:0036 >> config_classification_sink: 1 success / 0 missing / 0 failed
 [INFO] simplereport:0036 >> features_train_MRI_0_predict: 0 success / 0 missing / 328 failed
 [INFO] simplereport:0036 >> performance: 0 success / 0 missing / 1 failed
 [INFO] simplereport:0036 >> segmentations_out_segmentix_train_MRI_0: 0 success / 0 missing / 328 failed
 [INFO] simplereport:0037 >> ===================
 [WARNING] simplereport:0049 >> There were failed samples in the run, to start debugging you can run:

    fastr trace E:/WORC/Tmp\__sink_data__.json --sinks

see the debug section in the manual at https://fastr.readthedocs.io/en/develop/static/user_manual.html#debugging for more information.

I run fastr trace E:/WORC/Tmp\__sink_data__.json --sinks and get:

 [WARNING]  __init__:0084 >> Not running in a production installation (branch "unknown" from installed package)
classification -- 1 failed -- 0 succeeded
config_MRI_0_sink -- 1 failed -- 0 succeeded
config_classification_sink -- 0 failed -- 1 succeeded
features_train_MRI_0_predict -- 328 failed -- 0 succeeded
performance -- 1 failed -- 0 succeeded
segmentations_out_segmentix_train_MRI_0 -- 328 failed -- 0 succeeded

Running on windows, python 3.7, install via pip. WORC_config.py:

import os
import fastr
import pkg_resources
import site
import sys

# Get directory in which packages are installed
working_set = pkg_resources.working_set
requirement_spec = pkg_resources.Requirement.parse('WORC')
egg_info = working_set.find(requirement_spec)
if egg_info is None:  # Backwards compatibility with WORC2
    try:
        packagedir = site.getsitepackages()[0]
    except AttributeError:
        # Inside virtualenvironment, so getsitepackages doesnt work.
        paths = sys.path
        for p in paths:
            if os.path.isdir(p) and os.path.basename(p) == 'site-packages':
                packagedir = p
else:
    packagedir = egg_info.location

# Add the WORC FASTR tools and type paths
tools_path = [os.path.join(packagedir, 'WORC', 'resources', 'fastr_tools')] + tools_path
types_path = [os.path.join(packagedir, 'WORC', 'resources', 'fastr_types')] + types_path

# Mounts accessible to fastr virtual file system
mounts['worc_example_data'] = os.path.join(packagedir, 'WORC', 'exampledata')
mounts['apps'] = os.path.expanduser(os.path.join('~', 'apps'))
# mounts['output'] = os.path.expanduser(os.path.join('~', 'WORC', 'output'))
mounts['output'] = "E:\\WORC\\output"
mounts['home'] = "E:\\WORC"
mounts['test'] = os.path.join(packagedir, 'WORC', 'resources', 'fastr_tests')

# The ITKFile type requires a preferred type when no specification is given.
# We will set it to Nifti, but you may change this.
preferred_types += ["NiftiImageFileCompressed"]

How to debug / find out any thing misconfigured?
BTW, does WORC has any pause-and-resume mechanism?

@lyhyl lyhyl changed the title Most of jobs/classification tasks failed, how to debug? Most of jobs/classification tasks failed. How to locate the source of the error? May 31, 2023
@lyhyl
Copy link
Contributor Author

lyhyl commented May 31, 2023

Update:

When I debug on vscode, I get a error on the internal of fastr.

Exception has occurred: SystemExit
1
  File "C:\Users\user\anaconda3\envs\worc\Lib\site-packages\fastr\execution\executionscript.py", line 138, in execute_job
    sys.exit(1)  # Signal that the job failed
  File "C:\Users\user\anaconda3\envs\worc\Lib\site-packages\fastr\execution\executionscript.py", line 182, in main
    execute_job(joblist)
  File "C:\Users\user\anaconda3\envs\worc\Lib\site-packages\fastr\execution\executionscript.py", line 187, in <module>
    main()
SystemExit: 1

Callstack:
image

Debug console print(job):

<Job
  id=WORC_BCMS_SY___fingerprinter_MRI_0___all
  tool=worc/Fingerprinter:1.0 1.0
  tmpdir=vfs://home/tmp/fingerprinter_MRI_0/all/>

Console output stop at:

...
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P744961 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P744962 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P745979 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P80292 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P84221 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P85705 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P8632 with status JobState.finished
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P92423 with status JobState.finished
 [INFO] networkrun:0668 >> Waiting for 1976 jobs:
 [INFO] networkrun:0676 >> WORC_BCMS_SY___fingerprinter_classification___all: JobState.running
 [INFO] networkrun:0676 >> WORC_BCMS_SY___fingerprinter_MRI_0___all: JobState.queued
 [INFO] networkrun:0676 >> WORC_BCMS_SY___config_classification_sink___all___0: JobState.hold
 [INFO] networkrun:0676 >> WORC_BCMS_SY___config_MRI_0_sink___all___0: JobState.hold
 [INFO] networkrun:0676 >> WORC_BCMS_SY___preprocessing_train_MRI_0___P108851: JobState.hold
 [INFO] networkrun:0677 >> ---- 1966 JOBS HIDDEN ----
 [INFO] networkrun:0679 >> WORC_BCMS_SY___features_train_MRI_0_predict___P8632___0: JobState.hold
 [INFO] networkrun:0679 >> WORC_BCMS_SY___features_train_MRI_0_predict___P92423___0: JobState.hold
 [INFO] networkrun:0679 >> WORC_BCMS_SY___plot_Estimator___all: JobState.hold
 [INFO] networkrun:0679 >> WORC_BCMS_SY___classification___all___0: JobState.hold
 [INFO] networkrun:0679 >> WORC_BCMS_SY___performance___all___0: JobState.hold
 [INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___fingerprinter_classification___all with status JobState.finished

No more info about what's going on.

@lyhyl
Copy link
Contributor Author

lyhyl commented May 31, 2023

Update 2:
Finished job WORC_SY___fingerprinter_MRI_0___all with status JobState.failed

E:\WORC\Tmp\fingerprinter_MRI_0\all_fastr_stderr_.txt:

Traceback (most recent call last):
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\core\target.py", line 191, in call_subprocess
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
  File "C:\Users\user\anaconda3\envs\worc\lib\subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "C:\Users\user\anaconda3\envs\worc\lib\subprocess.py", line 1207, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 206] The filename or extension is too long

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\execution\executionscript.py", line 89, in execute_job
    job.execute()
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\execution\job.py", line 798, in execute
    result = tool.execute(payload)
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\core\tool.py", line 398, in execute
    result = self.interface.execute(target, payload)
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\resources\plugins\interfaceplugins\fastrinterface.py", line 471, in execute
    target_result = target.run_command(command)
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\resources\plugins\targetplugins\localbinarytarget.py", line 278, in run_command
    return self.call_subprocess(command)
  File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\core\target.py", line 194, in call_subprocess
    raise exceptions.FastrExecutableNotFoundError(command[0])
fastr.exceptions.FastrExecutableNotFoundError: Could not find executable "python" on PATH: 

My data path is flat and shallow.

E:/Data/000001.nrrd
E:/Data/seg-000001.nrrd
E:/Data/000002.nrrd
E:/Data/seg-000002.nrrd
E:/Data/000003.nrrd
...

@lyhyl
Copy link
Contributor Author

lyhyl commented Jun 1, 2023

Solved the case:

When processing fingerprinter, too many files will make WORC crash on Windows:
image

len(command)

668

len(" ".join(command))

38655

On Windows, subprocess.Popen use CreateProcess() function (ref).
And CreateProcess(lpApplicationName, lpCommandLine, ...) has limitation that the maximum length of this string is 32,767 characters (ref).

Thus, do not pass files name by command line args. It is better to save them to a file and then pass the list file as input.

@lyhyl lyhyl changed the title Most of jobs/classification tasks failed. How to locate the source of the error? Too many files can cause the fingerprinter to fail on Windows Jun 1, 2023
@MStarmans91
Copy link
Owner

Glad you found the issue. If you run into issues again, always use fastr trace to locate the error back to a specific sample in a specific sink, see https://fastr.readthedocs.io/en/stable/static/user_manual.html#debugging-a-network-run-with-errors.

Regarding the error, it's difficult to change the command line execution as pass such arguments as lists, as WORC uses the fastr package for this and does not do this itself. I will ask the fastr developers whether they can change this. I would suggest to either manually execute the command now that you found it, but maybe easier is just to reduce the number of images used for fingerprinting. The default number of images for the fingerprinting is 100, see config['Fingerprinting']['max_num_image'] in the config (https://worc.readthedocs.io/en/latest/static/configuration.html#fingerprinting). I set that pretty high just to be sure it's enough, but 10 - 20 should also be enough. If you're using SimpleWORC or BasicWORC, just change this using the add_config_overrides function of those objects.

Hope that helps.

@lyhyl
Copy link
Contributor Author

lyhyl commented Jun 5, 2023

Glad you found the issue. If you run into issues again, always use fastr trace to locate the error back to a specific sample in a specific sink, see https://fastr.readthedocs.io/en/stable/static/user_manual.html#debugging-a-network-run-with-errors.

Regarding the error, it's difficult to change the command line execution as pass such arguments as lists, as WORC uses the fastr package for this and does not do this itself. I will ask the fastr developers whether they can change this. I would suggest to either manually execute the command now that you found it, but maybe easier is just to reduce the number of images used for fingerprinting. The default number of images for the fingerprinting is 100, see config['Fingerprinting']['max_num_image'] in the config (https://worc.readthedocs.io/en/latest/static/configuration.html#fingerprinting). I set that pretty high just to be sure it's enough, but 10 - 20 should also be enough. If you're using SimpleWORC or BasicWORC, just change this using the add_config_overrides function of those objects.

Hope that helps.

Thank you for your framework and reply. But I'm afraid this option won't help with this issue (Looking back at the previous errors, len(command) equals 668, far greater than the default value of config['Fingerprinting']['max_num_image'], i.e., 100).
Option config['Fingerprinting']['max_num_image'] is used internally in:

max_num_images = int(config['Fingerprinting']['max_num_image'])
if len(self.images) > max_num_images:
self.images = self.images[0:max_num_images]
# FIXME
if self.segmentations is not None:
print('FIXME: segmentations is None')
self.segmentations = self.segmentations[0:max_num_images]

However, the failure mentioned above is occurred when fastr creating subprocess (starting a queued job). At that time, fingerprinting process did not yet exist. It is more of a limitation of fastr in Windows platform. It seems necessary to refactor the input/ouput form of Fingerprinting and related parts. Perhaps I can help.

On the other hand, I tested my code on linux. It seems to run very well, except for a minor issue:
pyradiomics=3.1.0 failed calcFeat jobs, which is released 3 weeks ago. A workaround is install pyradiomics=3.0.1 first, then install WORC. I have submitted an issue AIM-Harvard/pyradiomics#831.

BTW, does WORC have any pause-and-resume mechanism? Or add such functionality? Debugging with own data is indeed time-consuming. catch an error, run again from scratch, catch an error, run again from scratch, ... 😢

@MStarmans91
Copy link
Owner

Thanks for the detailed reply! I was hoping you wouldn't hit the limit this way, but you're right, you still do, and this is a general limit on Windows. I've raised an issue at the fastr package which like I mentions performs the execution, and thus is responsible for this limitation, see https://gitlab.com/radiology/infrastructure/fastr/-/issues/1. For small experiments like the tutorial everything works fine, but if you perform larger experiments with more data, this issue persists.

In the meanwhile, glad everything ran smoothly on Linux, hope pyradiomics fixes the bug soon.

WORC does have a pause-and-resume mechanism build in. Again, this falls back on fastr which performs the execution, see also https://fastr.readthedocs.io/en/stable/static/user_manual.html#continuing-a-network. Summarizing, fastr saves all temporary output in a folder named after the experiment, and if you runb an experiment with the same name, will check which jobs have previously succesfully completed and rerun this. Hence, as long as you keep the experiment name the same in WORC , e.g., https://github.com/MStarmans91/WORCTutorial/blob/master/WORCTutorialBasic.py#L86 of the WORCTutorialBasic, WORC will automatically resume from where it ended previously. Note that it will look like all jobs are still running and nothing is skipped, but these jobs will just check whether the previous instances have succesfully run and the output is valid, so this should be very quick.

I'll leave this issue open untill there is a fix in fastr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants