Skip to content

Commit

Permalink
Merge pull request #2301 from netdata/ingest
Browse files Browse the repository at this point in the history
Ingest New Documentation
  • Loading branch information
Ancairon authored Dec 24, 2024
2 parents 7863f13 + 3f24055 commit 860caa2
Show file tree
Hide file tree
Showing 2 changed files with 93 additions and 26 deletions.
33 changes: 32 additions & 1 deletion docs/collecting-metrics/Message Brokers/NATS.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,11 @@ The scope defines the instance that the metric belongs to. An instance is unique

These metrics refer to NATS servers.

This scope has no labels.
Labels:

| Label | Description |
|:-----------|:----------------|
| server_id | A unique identifier for a server within the NATS cluster. |

Metrics:

Expand All @@ -90,6 +94,7 @@ Labels:

| Label | Description |
|:-----------|:----------------|
| server_id | A unique identifier for a server within the NATS cluster. |
| http_endpoint | HTTP endpoint path. |

Metrics:
Expand All @@ -106,6 +111,7 @@ Labels:

| Label | Description |
|:-----------|:----------------|
| server_id | A unique identifier for a server within the NATS cluster. |
| account | Account name. |

Metrics:
Expand All @@ -128,6 +134,7 @@ Labels:

| Label | Description |
|:-----------|:----------------|
| server_id | A unique identifier for a server within the NATS cluster. |
| route_id | A unique identifier for a route within the NATS cluster. |
| remote_id | he unique identifier of the remote server connected via the route. |

Expand All @@ -147,6 +154,7 @@ Labels:

| Label | Description |
|:-----------|:----------------|
| server_id | A unique identifier for a server within the NATS cluster. |
| gateway | The name of the local gateway. |
| remote_gateway | The name of the remote gateway. |
| cid | A unique identifier for the connection. |
Expand All @@ -168,6 +176,7 @@ Labels:

| Label | Description |
|:-----------|:----------------|
| server_id | A unique identifier for a server within the NATS cluster. |
| gateway | The name of the local gateway. |
| remote_gateway | The name of the remote gateway. |
| cid | A unique identifier for the connection. |
Expand All @@ -181,6 +190,28 @@ Metrics:
| nats.outbound_gateway_conn_subscriptions | active | subscriptions |
| nats.outbound_gateway_conn_uptime | uptime | seconds |

### Per leaf node connection

These metrics refer to [Leaf Node Connections](https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#leaf-node-information).

Labels:

| Label | Description |
|:-----------|:----------------|
| remote_name | Unique identifier of the remote leaf node server, either its configured name or automatically assigned ID. |
| account | Name of the associated account. |
| ip | IP address of the remote server. |
| port | Port used for the connection to the remote server. |

Metrics:

| Metric | Dimensions | Unit |
|:------|:----------|:----|
| nats.leaf_node_conn_traffic | in, out | bytes/s |
| nats.leaf_node_conn_messages | in, out | messages/s |
| nats.leaf_node_conn_subscriptions | active | subscriptions |
| nats.leaf_node_conn_rtt | rtt | microseconds |



## Alerts
Expand Down
86 changes: 61 additions & 25 deletions docs/netdata-cloud/versions/on-prem/troubleshooting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,59 +20,95 @@ These components should be monitored and managed according to your organization'

## Common Issues

### Installation cannot finish
### Timeout During Installation

If you are getting error like:
If your installation fails with this error:

```
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.
```

There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured.
This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue.

To verify check if there are any `Pending` pods:
#### Diagnosis Steps

```shell
kubectl get pods -n netdata-cloud | grep -v Running
```
> **Important**
>
> - For full installation: Ensure you're in the correct cluster context.
> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured.
> - For Light PoC, always perform a complete uninstallation before attempting a new installation.
To check which resource is a limiting factor pick one of the `Pending` pods and issue command:
1. Check for pods stuck in Pending state:

```shell
kubectl describe pod <POD_NAME> -n netdata-cloud
```
```shell
kubectl get pods -n netdata-cloud | grep -v Running
```

At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear.
Please check the minimum requirements for your on-prem installation type or contact our support - `[email protected]`.
2. If you find Pending pods, examine the resource constraints:

> **Warning**
>
> In case of the Light PoC installations always uninstall before the next attempt.
```shell
kubectl describe pod <POD_NAME> -n netdata-cloud
```

Review the Events section at the bottom of the output. Look for messages about:
- Insufficient CPU
- Insufficient Memory
- Node capacity issues

3. View overall cluster resources:

```shell
# Check resource allocation across nodes
kubectl top nodes

# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
```

### Installation finished but login does not work
#### Solution

It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants:
1. Compare your available resources against the [minimum requirements](/docs/netdata-cloud/versions/on-prem/installation#system-requirements).
2. Take one of these actions:
- Add more resources to your cluster.
- Free up existing resources.

1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct.
2. Mail login is not working:
1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token.
2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings.
### Login Issues After Installation

If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem.
Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.

| Issue | Symptoms | Cause | Solution |
|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs<br/>- Expired/invalid SSO tokens<br/>- Untrusted certificates<br/>- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`<br/>- Verify certificates are valid and trusted<br/>- Ensure FQDN matches certificate |
| MailCatcher Login (Light PoC) | - Magic links not arriving<br/>- "Invalid token" errors | - Incorrect hostname during installation<br/>- Modified default MailCatcher values | - Reinstall with correct FQDN<br/>- Restore default MailCatcher settings<br/>- Ensure hostname matches certificate |
| Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration<br/>- Network connectivity issues | - Update SMTP settings in `values.yaml`<br/>- Verify network allows SMTP traffic<br/>- Check mail server logs |
| Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched `netdata-cloud-common` secret<br/>- Database hash mismatch<br/>- Namespace change without secret migration | - Migrate secret before namespace change<br/>- Perform fresh installation<br/>- Contact support for data recovery |

> **Warning**
>
> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation.
> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated.
>
> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
### Slow Chart Loading or Chart Errors

When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.

| Issue | Symptoms | Cause | Solution |
| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Agent Connectivity | - Queries stall or timeout<br/>- Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points) nodes to provide reliable backends. The system will automatically prefer these for queries when available |
| Kubernetes Resources | - Service throttling<br/>- Slow data processing<br/>- Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed |
| Database Performance | - Slow query responses<br/>- Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization:<br/>- CPU usage<br/>- Memory allocation<br/>- Disk I/O performance |
| Message Broker | - Delayed node status updates (online/offline/stale)<br/>- Slow alert transitions<br/>- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration<br/>- Adjust microservice resource allocation<br/>- Monitor message processing rates |

## Need Help?

If issues persist:

1. Gather the following information:

- Installation logs
- Your cluster specifications

2. Contact support at `[email protected]`.

0 comments on commit 860caa2

Please sign in to comment.