Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #2382
Replies: 4 comments
-
sorry for the delayed response here. We've passed this along to the SageMaker Processing team to see if they have any insight. Thanks for your patience! |
Beta Was this translation helpful? Give feedback.
-
Hi @ram-nadella - I am from the SageMaker Processing team. Apologies that you ran into this error.
Could you please clarify what you mean by "try to use the S3 file path in the same job"? It would be helpful for us to debug the issue if you could provide the ARN of the SageMaker Processing Job that experienced this error.
Selecting the Thanks for your feedback regarding the unclear documentation. We've noted this and working towards improving it. |
Beta Was this translation helpful? Give feedback.
-
Hi @jigsaw004 The logic in our processing job looks something like this:
The last step is where we are seeing intermittent failures. My original issue description has more details of both specifics about the steps and the actual error message. Copied the relevant parts below:
Thanks for the clarification regarding Continuous mode. I think it's the near real-time aspect of the continuous mode that is affecting us here – we want to be able to generate output to S3 and also load the contents of the generated file into a database (Redshift). |
Beta Was this translation helpful? Give feedback.
-
Thanks for the comment, @ram-nadella, much appreciated! :) I saw your initial post and the clarification above, and it makes more sense to me on what could be the problem here.
SageMaker would try to upload the contents written to the disk in near-real time. Note that the upload to S3 may experience some intermittent delays and/or retries, which would govern the overall time between write to the local disk and the file being available in S3. If you would like to work around the above assumption, could I ask that you add a retry in your Processing container logic to wait for the file to be available in S3, i.e. the S3 prefix to be valid? Since performance is a critical criteria here, having the retry logic be as aggressive as possible would make sense. If you see intermittent failures even after the retry logic (or if you're already implemented the logic before filing this issue), then could you please provide Job ARNs which we can investigate? |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
Processing job fails intermittently when using ProcessingOutput in Continuous mode and when the destination S3 path is used in the same job to load data into Redshift. We are using sqlalchemy to run the
COPY
command on Redshift. Logs below.We have had successful runs of this processing job, but it fails sometimes about an S3 prefix not being present. Same container image for all the runs.
To reproduce
Create a processing job that uses ProcessingOutput with output mode set to Continuous (instead of EndofJob) and try to use the S3 file path in the same job. In our case, we are loading data from the file into Redshift.
Expected behavior
In Continuous mode, expecting the S3 path to be accessible and ready as soon as file write to local path completes.
Also, the exact behaviour of what Continuous really means in ProcessingOutput appears to be undocumented.
Screenshots or logs
Truncated log lines (full logs below):
System information
A description of your system. Please provide:
1.50.10
Additional context
Beta Was this translation helpful? Give feedback.
All reactions