Skip to content

Default to mig-strategy none if not specified #590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented Mar 13, 2024

This change ensures that a mig-strategy=none is used if no mig strategy is explicitly specified when building the map of required devices.

This also removes the need to explicitly check for nil values when querying this value.

The MPS control daemon segfaults from the operator when dereferencing the nil mig strategy value in the config:

I0313 12:02:08.001435     146 main.go:72] Starting NVIDIA MPS Control Daemon 40ff9bf4
I0313 12:02:08.001626     146 main.go:55] "Starting NVIDIA MPS Control Daemon" version="40ff9bf4"
I0313 12:02:08.001645     146 main.go:101] Starting OS watcher.
I0313 12:02:08.001867     146 main.go:115] Starting Daemons.
I0313 12:02:08.001901     146 main.go:158] Loading configuration.
I0313 12:02:08.002392     146 main.go:166] Updating config with default resource matching patterns.
I0313 12:02:08.002445     146 main.go:177]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": null,
    "failOnInitError": null,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "plugin": {
      "passDeviceSpecs": null,
      "deviceListStrategy": null,
      "deviceIDStrategy": null,
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": null
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 10
        }
      ]
    }
  }
}
I0313 12:02:08.002455     146 main.go:181] Retrieving MPS daemons.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xaff1ba]

goroutine 1 [running]:
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*deviceMapBuilder).buildDeviceMapFromConfigResources(0xc000128d20)
	/build/internal/rm/device_map.go:69 +0x9a
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*deviceMapBuilder).build(0xc000128d20)
	/build/internal/rm/device_map.go:51 +0x25
github.com/NVIDIA/k8s-device-plugin/internal/rm.NewDeviceMap({0xdbf200, 0xc000118d90}, 0xc0001200b0)
	/build/internal/rm/device_map.go:46 +0x12f
github.com/NVIDIA/k8s-device-plugin/internal/rm.NewNVMLResourceManagers({0xdbf200, 0xc000118d90}, 0xc0001200b0)
	/build/internal/rm/nvml_manager.go:49 +0xee
github.com/NVIDIA/k8s-device-plugin/cmd/mps-control-daemon/mps.(*manager).Daemons(0xc00018f988?)
	/build/cmd/mps-control-daemon/mps/manager.go:71 +0x3c
github.com/NVIDIA/k8s-device-plugin/cmd/mps-control-daemon/mps.NewDaemons({0xc00018f988?, 0x1?, 0x1?})
	/build/cmd/mps-control-daemon/mps/manager.go:46 +0x79
main.startDaemons(0xc0000b2960?, 0xc00018fb00?)
	/build/cmd/mps-control-daemon/main.go:182 +0x4ff
main.start(0xc000252240?, 0x2?)
	/build/cmd/mps-control-daemon/main.go:116 +0x1fe
main.main.func1(0xc0000b1a40)
	/build/cmd/mps-control-daemon/main.go:56 +0x125
github.com/urfave/cli/v2.(*Command).Run(0xc0001c78c0, 0xc0000b1a40, {0xc000036250, 0x1, 0x1})
	/build/vendor/github.com/urfave/cli/v2/command.go:279 +0xa3c
github.com/urfave/cli/v2.(*App).RunContext(0xc00023e200, {0xdbb6e8?, 0xc00003c070}, {0xc000036250, 0x1, 0x1})
	/build/vendor/github.com/urfave/cli/v2/app.go:337 +0x63a
github.com/urfave/cli/v2.(*App).Run(...)
	/build/vendor/github.com/urfave/cli/v2/app.go:311
main.main()
	/build/cmd/mps-control-daemon/main.go:73 +0x436

@elezar elezar self-assigned this Mar 13, 2024
@klueska
Copy link
Contributor

klueska commented Mar 13, 2024

Why mig-strategy none, instead of mig-strategy single? The none strategy was only added originally as a way to continue running the device plugin as it "always had", avoiding the code paths introduced to handle the single and mixed strategies. I wasn't comfortable defaulting to the single strategy back then, but I probably would be now. In fact, I would almost argue we should deprecate the none strategy, and consider removing it at some stage.

@elezar
Copy link
Member Author

elezar commented Mar 13, 2024

Why mig-strategy none, instead of mig-strategy single? The none strategy was only added originally as a way to continue running the device plugin as it "always had", avoiding the code paths introduced to handle the single and mixed strategies. I wasn't comfortable defaulting to the single strategy back then, but I probably would be now. In fact, I would almost argue we should deprecate the none strategy, and consider removing it at some stage.

Sure, that's a fair point, but doesn't match what the current behaviour is. See

Value: spec.MigStrategyNone,
and
Value: spec.MigStrategyNone,

I don't mind changing the default behaviour, but don't want to do it in this PR since may have other implications.

Given the links above, maybe the correct way to do this though is to also add a mig-strategy command line argument to the mps-control-daemon exectuable and set it to None there.

@elezar elezar force-pushed the fix-mig-strategy-reference branch from f64d879 to 6713641 Compare March 13, 2024 13:18
@klueska
Copy link
Contributor

klueska commented Mar 13, 2024

OK, if the default throughout is still none, then let's stick with that and think through changing it at some later date.

Copy link
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming defaulting to migStrategy: node is okay

This change makes the handling of the MIG strategy consistent with
GFD and the device plugin.

This also removes the need to explicitly check for nil values when querying
this value.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the fix-mig-strategy-reference branch from 6713641 to 93f161c Compare March 13, 2024 19:07
@elezar elezar requested a review from klueska March 13, 2024 19:09
@elezar
Copy link
Member Author

elezar commented Mar 15, 2024

Confirmed that the value in the config overrides the CLI setting.

@elezar elezar merged commit 32bb029 into NVIDIA:main Mar 15, 2024
6 checks passed
@elezar elezar deleted the fix-mig-strategy-reference branch March 15, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants