Add gpu stats endpoint for internal monitoring #109

Redm4x · 2024-02-22T20:37:15Z

Added 2 api endpoints for internal monitoring (#106)

`https://api.cloudmos.io/internal/gpu`

Returns a summary of the gpus on the network.

Example Response

{
  "gpus": {
    "total": {
      "allocatable": 2,
      "allocated": 0
    },
    "details": {
      "nvidia": [
        {
          "model": "t4",
          "ram": "16Gi",
          "interface": "PCIe",
          "allocatable": 1,
          "allocated": 0
        },
        {
          "model": "rtx3060ti",
          "ram": "8Gi",
          "interface": "PCIe",
          "allocatable": 1,
          "allocated": 0
        }
      ]
    }
  }
}

Query parameters for filtering

Param	Description
provider	Either a provider address (ex: `akash175llqyjvxfle9qwt740vm46772dzaznpzgm576`) or a host uri (ex: `https://provider.akashprovid.com:8443`)
vendor	Ex: `nvidia`
model	Ex: `t4`
memory_size	Ex: `16Gi`

All query parameters can be combined, ex:
https://api.cloudmos.io/internal/gpu?provider=akash175llqyjvxfle9qwt740vm46772dzaznpzgm576&vendor=nvidia&model=rtx3060ti&memory_size=8Gi

`https://api.cloudmos.io/internal/provider-versions`

Returns a list of versions and the providers that are currently on that version. The <UNKNOWN> version correspond to providers where the version could not be determined. The /version endpoint was broken for a long time, but is now fixed in v0.5.0-rc11

Example Response

{
  "0.5.0-rc16": {
    "version": "0.5.0-rc16",
    "count": 4,
    "ratio": 0.05,
    "providers": [
      "https://provider.moonbys.cloud:8443",
      "https://provider.akashprovid.com:8443",
      "https://provider.akashtesting.xyz:8443"
    ]
  },
  "0.5.0-rc15": {
    "version": "0.5.0-rc16",
    "count": 4,
    "ratio": 0.05,
    "providers": [
      "https://provider.akash.pro:8443"
    ]
  },
  "<UNKNOWN>": {
    "version": "<UNKNOWN>",
    "count": 80,
    "ratio": 0.95,
    "providers": [
      "https://provider.macptrading.com:8443",
      "https://provider.digitaler-friedhof.com:8443",
      "https://provider.qioi.io:8443",
      "https://provider.bluepeer.io:8443",
      ...
    ]
  }
}

Redm4x added 4 commits February 21, 2024 15:58

Add internal router

f6e7444

Basic implementation based on spec

8c3de01

Add query params for filtering

97b6dcd

Fix query

c437f60

Redm4x mentioned this pull request Feb 22, 2024

Implement GPU endpoints with provider discovery #106

Closed

Redm4x added 2 commits February 22, 2024 16:02

Add documentation

a907f74

.

8872ef7

baktun14 approved these changes Feb 22, 2024

View reviewed changes

Fix typo in filename

d1c1839

Redm4x marked this pull request as ready for review February 24, 2024 17:56

Redm4x merged commit db887c0 into main Feb 24, 2024
5 checks passed

Redm4x deleted the features/add-gpu-stats-endpoint branch February 24, 2024 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu stats endpoint for internal monitoring #109

Add gpu stats endpoint for internal monitoring #109

Redm4x commented Feb 22, 2024

Add gpu stats endpoint for internal monitoring #109

Add gpu stats endpoint for internal monitoring #109

Conversation

Redm4x commented Feb 22, 2024

https://api.cloudmos.io/internal/gpu

Example Response

Query parameters for filtering

https://api.cloudmos.io/internal/provider-versions

Example Response

`https://api.cloudmos.io/internal/gpu`

`https://api.cloudmos.io/internal/provider-versions`