Merge pull request #2301 from netdata/ingest

Ingest New Documentation
netdata · Dec 24, 2024 · 860caa2 · 860caa2
2 parents 7863f13 + 3f24055
commit 860caa2
Show file tree

Hide file tree

Showing 2 changed files with 93 additions and 26 deletions.
diff --git a/docs/collecting-metrics/Message Brokers/NATS.mdx b/docs/collecting-metrics/Message Brokers/NATS.mdx
@@ -67,7 +67,11 @@ The scope defines the instance that the metric belongs to. An instance is unique
 
 These metrics refer to NATS servers.
 
-This scope has no labels.
+Labels:
+
+| Label      | Description     |
+|:-----------|:----------------|
+| server_id | A unique identifier for a server within the NATS cluster. |
 
 Metrics:
 
@@ -90,6 +94,7 @@ Labels:
 
 | Label      | Description     |
 |:-----------|:----------------|
+| server_id | A unique identifier for a server within the NATS cluster. |
 | http_endpoint | HTTP endpoint path. |
 
 Metrics:
@@ -106,6 +111,7 @@ Labels:
 
 | Label      | Description     |
 |:-----------|:----------------|
+| server_id | A unique identifier for a server within the NATS cluster. |
 | account | Account name. |
 
 Metrics:
@@ -128,6 +134,7 @@ Labels:
 
 | Label      | Description     |
 |:-----------|:----------------|
+| server_id | A unique identifier for a server within the NATS cluster. |
 | route_id | A unique identifier for a route within the NATS cluster. |
 | remote_id | he unique identifier of the remote server connected via the route. |
 
@@ -147,6 +154,7 @@ Labels:
 
 | Label      | Description     |
 |:-----------|:----------------|
+| server_id | A unique identifier for a server within the NATS cluster. |
 | gateway | The name of the local gateway. |
 | remote_gateway | The name of the remote gateway. |
 | cid | A unique identifier for the connection. |
@@ -168,6 +176,7 @@ Labels:
 
 | Label      | Description     |
 |:-----------|:----------------|
+| server_id | A unique identifier for a server within the NATS cluster. |
 | gateway | The name of the local gateway. |
 | remote_gateway | The name of the remote gateway. |
 | cid | A unique identifier for the connection. |
@@ -181,6 +190,28 @@ Metrics:
 | nats.outbound_gateway_conn_subscriptions | active | subscriptions |
 | nats.outbound_gateway_conn_uptime | uptime | seconds |
 
+### Per leaf node connection
+
+These metrics refer to [Leaf Node Connections](https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#leaf-node-information).
+
+Labels:
+
+| Label      | Description     |
+|:-----------|:----------------|
+| remote_name | Unique identifier of the remote leaf node server, either its configured name or automatically assigned ID. |
+| account | Name of the associated account. |
+| ip | IP address of the remote server. |
+| port | Port used for the connection to the remote server. |
+
+Metrics:
+
+| Metric | Dimensions | Unit |
+|:------|:----------|:----|
+| nats.leaf_node_conn_traffic | in, out | bytes/s |
+| nats.leaf_node_conn_messages | in, out | messages/s |
+| nats.leaf_node_conn_subscriptions | active | subscriptions |
+| nats.leaf_node_conn_rtt | rtt | microseconds |
+
 
 
 ## Alerts

diff --git a/docs/netdata-cloud/versions/on-prem/troubleshooting.mdx b/docs/netdata-cloud/versions/on-prem/troubleshooting.mdx
@@ -20,59 +20,95 @@ These components should be monitored and managed according to your organization'
 
 ## Common Issues
 
-### Installation cannot finish
+### Timeout During Installation
 
-If you are getting error like:
+If your installation fails with this error:
 
 ```
 Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
 [...]
 Error: client rate limiter Wait returned an error:  Context deadline exceeded.
 ```
 
-There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured.
+This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue.
 
-To verify check if there are any `Pending` pods:
+#### Diagnosis Steps
 
-```shell
-kubectl get pods -n netdata-cloud | grep -v Running
-```
+> **Important**
+>
+> - For full installation: Ensure you're in the correct cluster context.
+> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured.
+> - For Light PoC, always perform a complete uninstallation before attempting a new installation.
 
-To check which resource is a limiting factor pick one of the `Pending` pods and issue command:
+1. Check for pods stuck in Pending state:
 
-```shell
-kubectl describe pod <POD_NAME> -n netdata-cloud
-```
+   ```shell
+   kubectl get pods -n netdata-cloud | grep -v Running
+   ```
 
-At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear.
-Please check the minimum requirements for your on-prem installation type or contact our support - `[email protected]`.
+2. If you find Pending pods, examine the resource constraints:
 
-> **Warning**
->
-> In case of the Light PoC installations always uninstall before the next attempt.
+   ```shell
+   kubectl describe pod <POD_NAME> -n netdata-cloud
+   ```
+
+   Review the Events section at the bottom of the output. Look for messages about:
+    - Insufficient CPU
+    - Insufficient Memory
+    - Node capacity issues
+
+3. View overall cluster resources:
+
+   ```shell
+   # Check resource allocation across nodes
+   kubectl top nodes
+
+   # View detailed node capacity
+   kubectl describe nodes | grep -A 5 "Allocated resources"
+   ```
 
-### Installation finished but login does not work
+#### Solution
 
-It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants:
+1. Compare your available resources against the [minimum requirements](/docs/netdata-cloud/versions/on-prem/installation#system-requirements).
+2. Take one of these actions:
+    - Add more resources to your cluster.
+    - Free up existing resources.
 
-1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct.
-2. Mail login is not working:
-   1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token.
-   2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings.
+### Login Issues After Installation
 
-If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem.
+Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.
+
+| Issue                         | Symptoms                                                | Cause                                                                                                                         | Solution                                                                                                                          |
+|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
+| SSO Login Failure             | Unable to authenticate via SSO providers                | - Invalid callback URLs<br/>- Expired/invalid SSO tokens<br/>- Untrusted certificates<br/>- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`<br/>- Verify certificates are valid and trusted<br/>- Ensure FQDN matches certificate |
+| MailCatcher Login (Light PoC) | - Magic links not arriving<br/>- "Invalid token" errors | - Incorrect hostname during installation<br/>- Modified default MailCatcher values                                            | - Reinstall with correct FQDN<br/>- Restore default MailCatcher settings<br/>- Ensure hostname matches certificate                |
+| Custom Mail Server Login      | Magic links not arriving                                | - Incorrect SMTP configuration<br/>- Network connectivity issues                                                              | - Update SMTP settings in `values.yaml`<br/>- Verify network allows SMTP traffic<br/>- Check mail server logs                     |
+| Invalid Token Error           | "Something went wrong - invalid token" message          | - Mismatched `netdata-cloud-common` secret<br/>- Database hash mismatch<br/>- Namespace change without secret migration       | - Migrate secret before namespace change<br/>- Perform fresh installation<br/>- Contact support for data recovery                 |
 
 > **Warning**
 >
-> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation.
+> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated.
+>
+> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
 
 ### Slow Chart Loading or Chart Errors
 
 When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.
 
 | Issue                | Symptoms                                                                                                        | Cause                                                                        | Solution                                                                                                                                                                                  |
-| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | Agent Connectivity   | - Queries stall or timeout<br/>- Inconsistent chart loading                                                     | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points) nodes to provide reliable backends. The system will automatically prefer these for queries when available |
 | Kubernetes Resources | - Service throttling<br/>- Slow data processing<br/>- Delayed dashboard updates                                 | Resource saturation at the node level or restrictive container limits        | Review and adjust container resource limits and node capacity as needed                                                                                                                   |
 | Database Performance | - Slow query responses<br/>- Increased latency across services                                                  | PostgreSQL performance bottlenecks                                           | Monitor and optimize database resource utilization:<br/>- CPU usage<br/>- Memory allocation<br/>- Disk I/O performance                                                                    |
 | Message Broker       | - Delayed node status updates (online/offline/stale)<br/>- Slow alert transitions<br/>- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks                 | - Review Pulsar configuration<br/>- Adjust microservice resource allocation<br/>- Monitor message processing rates                                                                        |
+
+## Need Help?
+
+If issues persist:
+
+1. Gather the following information:
+
+    - Installation logs
+    - Your cluster specifications
+
+2. Contact support at `[email protected]`.