Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead. #51

Open
angelcymak opened this issue Jul 10, 2024 · 1 comment

Comments

@angelcymak
Copy link

Encountered a MPS error (see title) when running this demo script. Please advise.

## train ST
st.train_and_fit(
    callbacks=[cb_early_stopping],
    logger=[log_tb],
)

Error

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name | Type | Params
------------------------------
------------------------------
0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:436: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:293: The number of training batches (27) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 0: 0%
0/27 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 ## train ST
----> 2 st.train_and_fit(
      3     callbacks=[cb_early_stopping],
      4     logger=[log_tb],
      5 )

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/starling/starling.py:353, in ST.train_and_fit(self, accelerator, strategy, devices, num_nodes, precision, logger, callbacks, fast_dev_run, max_epochs, min_epochs, max_steps, min_steps, max_time, limit_train_batches, limit_val_batches, limit_test_batches, limit_predict_batches, overfit_batches, val_check_interval, check_val_every_n_epoch, num_sanity_val_steps, log_every_n_steps, enable_checkpointing, enable_progress_bar, enable_model_summary, accumulate_grad_batches, gradient_clip_val, gradient_clip_algorithm, deterministic, benchmark, inference_mode, use_distributed_sampler, profiler, detect_anomaly, barebones, plugins, sync_batchnorm, reload_dataloaders_every_n_epochs, default_root_dir)
    349 _locals.pop("self")
    351 trainer = pl.Trainer(**_locals)
--> 353 trainer.fit(self)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:545, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    543 self.state.status = TrainerStatus.RUNNING
    544 self.training = True
--> 545 call._call_and_handle_interrupt(
    546     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    547 )

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     42     if trainer.strategy.launcher is not None:
     43         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44     return trainer_fn(*args, **kwargs)
     46 except _TunerExitException:
     47     _call_teardown_hook(trainer)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:581, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    574 assert self.state.fn is not None
    575 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    576     self.state.fn,
    577     ckpt_path,
    578     model_provided=True,
    579     model_connected=self.lightning_module is not None,
    580 )
--> 581 self._run(model, ckpt_path=ckpt_path)
    583 assert self.state.stopped
    584 self.training = False

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:990, in Trainer._run(self, model, ckpt_path)
    985 self._signal_connector.register_signal_handlers()
    987 # ----------------------------
    988 # RUN THE TRAINER
    989 # ----------------------------
--> 990 results = self._run_stage()
    992 # ----------------------------
    993 # POST-Training CLEAN UP
    994 # ----------------------------
    995 log.debug(f"{self.__class__.__name__}: trainer tearing down")

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1036, in Trainer._run_stage(self)
   1034         self._run_sanity_check()
   1035     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1036         self.fit_loop.run()
   1037     return None
   1038 raise RuntimeError(f"Unexpected state {self.state}")

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:202, in _FitLoop.run(self)
    200 try:
    201     self.on_advance_start()
--> 202     self.advance()
    203     self.on_advance_end()
    204     self._restarting = False

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:359, in _FitLoop.advance(self)
    357 with self.trainer.profiler.profile("run_training_epoch"):
    358     assert self._data_fetcher is not None
--> 359     self.epoch_loop.run(self._data_fetcher)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py:136, in _TrainingEpochLoop.run(self, data_fetcher)
    134 while not self.done:
    135     try:
--> 136         self.advance(data_fetcher)
    137         self.on_advance_end(data_fetcher)
    138         self._restarting = False

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py:213, in _TrainingEpochLoop.advance(self, data_fetcher)
    211     batch = trainer.precision_plugin.convert_input(batch)
    212     batch = trainer.lightning_module._on_before_batch_transfer(batch, dataloader_idx=0)
--> 213     batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=0)
    215 self.batch_progress.increment_ready()
    216 trainer._logger_connector.on_batch_start(batch)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:309, in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
    306     return None
    308 with trainer.profiler.profile(f"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}"):
--> 309     output = fn(*args, **kwargs)
    311 # restore current_fx when nested context
    312 pl_module._current_fx_name = prev_fx_name

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py:269, in Strategy.batch_to_device(self, batch, device, dataloader_idx)
    267 device = device or self.root_device
    268 if model is not None:
--> 269     return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
    270 return move_data_to_device(batch, device)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/core/module.py:333, in LightningModule._apply_batch_transfer_handler(self, batch, device, dataloader_idx)
    329 def _apply_batch_transfer_handler(
    330     self, batch: Any, device: Optional[torch.device] = None, dataloader_idx: int = 0
    331 ) -> Any:
    332     device = device or self.device
--> 333     batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
    334     batch = self._call_batch_hook("on_after_batch_transfer", batch, dataloader_idx)
    335     return batch

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/core/module.py:322, in LightningModule._call_batch_hook(self, hook_name, *args)
    319     else:
    320         trainer_method = call._call_lightning_datamodule_hook
--> 322     return trainer_method(trainer, hook_name, *args)
    323 hook = getattr(self, hook_name)
    324 return hook(*args)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:157, in _call_lightning_module_hook(trainer, hook_name, pl_module, *args, **kwargs)
    154 pl_module._current_fx_name = hook_name
    156 with trainer.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{hook_name}"):
--> 157     output = fn(*args, **kwargs)
    159 # restore current_fx when nested context
    160 pl_module._current_fx_name = prev_fx_name

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/core/hooks.py:583, in DataHooks.transfer_batch_to_device(self, batch, device, dataloader_idx)
    532 def transfer_batch_to_device(self, batch: Any, device: torch.device, dataloader_idx: int) -> Any:
    533     """Override this hook if your :class:`~torch.utils.data.DataLoader` returns tensors wrapped in a custom data
    534     structure.
    535 
   (...)
    581 
    582     """
--> 583     return move_data_to_device(batch, device)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lightning_fabric/utilities/apply_func.py:102, in move_data_to_device(batch, device)
     99     # user wrongly implemented the `_TransferableDataType` and forgot to return `self`.
    100     return data
--> 102 return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py:66, in apply_to_collection(data, dtype, function, wrong_dtype, include_none, allow_frozen, *args, **kwargs)
     64     return function(data, *args, **kwargs)
     65 if data.__class__ is list and all(isinstance(x, dtype) for x in data):  # 1d homogeneous list
---> 66     return [function(x, *args, **kwargs) for x in data]
     67 if data.__class__ is tuple and all(isinstance(x, dtype) for x in data):  # 1d homogeneous tuple
     68     return tuple(function(x, *args, **kwargs) for x in data)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py:66, in <listcomp>(.0)
     64     return function(data, *args, **kwargs)
     65 if data.__class__ is list and all(isinstance(x, dtype) for x in data):  # 1d homogeneous list
---> 66     return [function(x, *args, **kwargs) for x in data]
     67 if data.__class__ is tuple and all(isinstance(x, dtype) for x in data):  # 1d homogeneous tuple
     68     return tuple(function(x, *args, **kwargs) for x in data)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lightning_fabric/utilities/apply_func.py:96, in move_data_to_device.<locals>.batch_to(data)
     94 if isinstance(data, Tensor) and isinstance(device, torch.device) and device.type not in _BLOCKING_DEVICE_TYPES:
     95     kwargs["non_blocking"] = True
---> 96 data_output = data.to(device, **kwargs)
     97 if data_output is not None:
     98     return data_output

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
@joaolsf
Copy link

joaolsf commented Jan 21, 2025

I encountered the exact same error. Is there any update for a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants