sagemaker job failing in transformation step - Factorization Machines #2354
Replies: 6 comments
-
Hi a-torrano-m, could you please share training job hyperparameters as well? |
Beta Was this translation helpful? Give feedback.
-
Hi yatasho, |
Beta Was this translation helpful? Give feedback.
-
Could this hyperparameters be tested? Is the error reproducible? thanks |
Beta Was this translation helpful? Give feedback.
-
Could some reason be found for the issue? thanks |
Beta Was this translation helpful? Give feedback.
-
Hi a-torrano-m, the error is reproducible. We will work on a fix. Thanks for reporting the issue. |
Beta Was this translation helpful? Give feedback.
-
Thanks yatasho! have you produced some "jira-ticket" or issue code we could read to follow up how is it advancing? otherwise, we will wait the news in this thread if you send any message. |
Beta Was this translation helpful? Give feedback.
-
Reference: SMAlgo-314
Please fill out the form below.
System Information
Describe the problem
We are aiming to produce recommendations using sagemaker with factorization machines. We feed the model with a sparse matrix of 45000 rows and 15000 columns. Training completes successfully. The batch transformation stage crashes during the wait(), the exception redirects to read the logs. The message is : “Unable to get response from algorithm.”
Minimal repro / logs
Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
EXCEPTION OUTPUT:
ValueError Traceback (most recent call last)
in ()
13 print(datetime.datetime.now().time())
14
---> 15 fmTr.wait()
16 print(datetime.datetime.now().time())
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
205 def wait(self):
206 self._ensure_last_transform_job()
--> 207 self.latest_transform_job.wait()
208
209 def _ensure_last_transform_job(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
304
305 def wait(self):
--> 306 self.sagemaker_session.wait_for_transform_job(self.job_name)
307
308 @staticmethod
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_transform_job(self, job, poll)
1004 """
1005 desc = _wait_until(lambda: _transform_job_status(self.sagemaker_client, job), poll)
-> 1006 self._check_job_status(job, desc, "TransformJobStatus")
1007 return desc
1008
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
1026 reason = desc.get("FailureReason", "(No reason provided)")
1027 job_type = status_key_name.replace("JobStatus", " job")
-> 1028 raise ValueError("Error for {} {}: {} Reason: {}".format(job_type, job, status, reason))
1029
1030 def wait_for_endpoint(self, endpoint, poll=5):
ValueError: Error for Transform job factorization-machines-2019-08-01-09-40-45-581: Failed Reason: InternalServerError: We encountered an internal error. Please try again.
LOG MESSAGE :
2019-08-01T09:44:02.787:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD
2019-08-01T09:45:48.275:[sagemaker logs]: (...bucket and key...)/BATCH_jobName.csv000.json: Unable to get response from algorithm
Beta Was this translation helpful? Give feedback.
All reactions