Skip to content

Conversation

gakkiri
Copy link

@gakkiri gakkiri commented Sep 11, 2025

Description

Fix serialization issues in _set_wandb_writer function that can cause failures when passing complex argument configurations to wandb.init().

Problem

The current implementation directly passes the args namespace to wandb as configuration, which can fail when the args contain non-serializable objects such as:

  • bytes objects
  • torch.Tensor instances
  • Function/callable objects
  • Type objects
  • Other objects that cannot be JSON-serialized

This leads to serialization errors during wandb initialization, preventing proper logging functionality.

   File \"/root/Megatron-LM/megatron/training/global_vars.py\", line 211, in _set_wandb_writer\n"}
     wandb.init(**wandb_kwargs)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 1620, in init\n"}
     wandb._sentry.reraise(e)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/analytics/sentry.py\", line 157, in reraise\n"}
     raise exc.with_traceback(sys.exc_info()[2])\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 1606, in init\n"}
     return wi.init(run_settings, run_config, run_printer)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 981, in init\n"}
     run_init_handle = backend.interface.deliver_run(run)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 909, in deliver_run\n"}
     run_record = self._make_run(run)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 182, in _make_run\n"}
     self._make_config(data=config_dict, obj=proto_run.config)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 125, in _make_config\n"}
     update.value_json = json_dumps_safer(json_friendly(v)[0])\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/util.py\", line 806, in json_dumps_safer\n"}
     return dumps(obj, cls=WandBJSONEncoder, **kwargs)\n"}
   File \"/usr/lib/python3.10/json/__init__.py\", line 238, in dumps\n"}
     **kw).encode(obj)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 199, in encode\n"}
     chunks = self.iterencode(o, _one_shot=True)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 257, in iterencode\n"}
     return _iterencode(o, 0)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/util.py\", line 755, in default\n"}
     return json.JSONEncoder.default(self, obj)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 179, in default\n"}
     raise TypeError(f'Object of type {o.__class__.__name__} '\n"}
 TypeError: Object of type dtype is not JSON serializable\n"}

Solution

Added a comprehensive sanitization function _clean() that:

  1. Filters out non-serializable types: Removes bytes, type, and callable objects from the configuration
  2. Converts tensors to serializable format: Automatically converts torch.Tensor and numpy arrays to lists using .tolist()
  3. Handles bytes gracefully: Converts bytes to UTF-8 strings with error handling
  4. Validates JSON compatibility: Performs a final JSON serialization check to ensure all values are safe
  5. Provides fallback handling: Uses repr() as a last resort for any remaining problematic objects

Changes Made

  • Enhanced _set_wandb_writer() function in megatron/training/global_vars.py
  • Added _clean() helper function for recursive sanitization
  • Maintained full backward compatibility
  • Preserved all existing functionality while improving robustness

Testing

  • Preserves existing wandb logging behavior
  • Handles common non-serializable objects (bytes, tensors, callables)
  • Maintains backward compatibility with existing configurations
  • No breaking changes to the API

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Impact

This fix ensures that wandb logging works reliably across different training configurations, especially when using complex argument setups that may include tensors, custom types, or other non-serializable objects. The change is minimal and focused, reducing the risk of introducing new issues while solving a real-world problem that can prevent proper experiment tracking.

Related Issues

Fixes potential TypeError and ValueError exceptions during wandb initialization when args contain non-serializable objects.

- Add sanitization for non-serializable objects (bytes, types, callables)
- Handle torch.Tensor and numpy arrays by converting to lists
- Add JSON safety check as final validation
- Preserve existing functionality while ensuring wandb config is serializable
Copy link

copy-pr-bot bot commented Sep 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sbhavani sbhavani added bug Something isn't working module: debugging labels Sep 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module: debugging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants