From 3f240559ac27196ac435ab2449b77d8faffaa887 Mon Sep 17 00:00:00 2001 From: netdatabot <43409846+netdatabot@users.noreply.github.com> Date: Tue, 24 Dec 2024 11:33:45 +0000 Subject: [PATCH] Ingest new documentation --- .../Message Brokers/NATS.mdx | 33 ++++++- .../versions/on-prem/troubleshooting.mdx | 86 +++++++++++++------ 2 files changed, 93 insertions(+), 26 deletions(-) diff --git a/docs/collecting-metrics/Message Brokers/NATS.mdx b/docs/collecting-metrics/Message Brokers/NATS.mdx index fe4daa336..57d2c05ee 100644 --- a/docs/collecting-metrics/Message Brokers/NATS.mdx +++ b/docs/collecting-metrics/Message Brokers/NATS.mdx @@ -67,7 +67,11 @@ The scope defines the instance that the metric belongs to. An instance is unique These metrics refer to NATS servers. -This scope has no labels. +Labels: + +| Label | Description | +|:-----------|:----------------| +| server_id | A unique identifier for a server within the NATS cluster. | Metrics: @@ -90,6 +94,7 @@ Labels: | Label | Description | |:-----------|:----------------| +| server_id | A unique identifier for a server within the NATS cluster. | | http_endpoint | HTTP endpoint path. | Metrics: @@ -106,6 +111,7 @@ Labels: | Label | Description | |:-----------|:----------------| +| server_id | A unique identifier for a server within the NATS cluster. | | account | Account name. | Metrics: @@ -128,6 +134,7 @@ Labels: | Label | Description | |:-----------|:----------------| +| server_id | A unique identifier for a server within the NATS cluster. | | route_id | A unique identifier for a route within the NATS cluster. | | remote_id | he unique identifier of the remote server connected via the route. | @@ -147,6 +154,7 @@ Labels: | Label | Description | |:-----------|:----------------| +| server_id | A unique identifier for a server within the NATS cluster. | | gateway | The name of the local gateway. | | remote_gateway | The name of the remote gateway. | | cid | A unique identifier for the connection. | @@ -168,6 +176,7 @@ Labels: | Label | Description | |:-----------|:----------------| +| server_id | A unique identifier for a server within the NATS cluster. | | gateway | The name of the local gateway. | | remote_gateway | The name of the remote gateway. | | cid | A unique identifier for the connection. | @@ -181,6 +190,28 @@ Metrics: | nats.outbound_gateway_conn_subscriptions | active | subscriptions | | nats.outbound_gateway_conn_uptime | uptime | seconds | +### Per leaf node connection + +These metrics refer to [Leaf Node Connections](https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#leaf-node-information). + +Labels: + +| Label | Description | +|:-----------|:----------------| +| remote_name | Unique identifier of the remote leaf node server, either its configured name or automatically assigned ID. | +| account | Name of the associated account. | +| ip | IP address of the remote server. | +| port | Port used for the connection to the remote server. | + +Metrics: + +| Metric | Dimensions | Unit | +|:------|:----------|:----| +| nats.leaf_node_conn_traffic | in, out | bytes/s | +| nats.leaf_node_conn_messages | in, out | messages/s | +| nats.leaf_node_conn_subscriptions | active | subscriptions | +| nats.leaf_node_conn_rtt | rtt | microseconds | + ## Alerts diff --git a/docs/netdata-cloud/versions/on-prem/troubleshooting.mdx b/docs/netdata-cloud/versions/on-prem/troubleshooting.mdx index fc641a110..6a10d5045 100644 --- a/docs/netdata-cloud/versions/on-prem/troubleshooting.mdx +++ b/docs/netdata-cloud/versions/on-prem/troubleshooting.mdx @@ -20,9 +20,9 @@ These components should be monitored and managed according to your organization' ## Common Issues -### Installation cannot finish +### Timeout During Installation -If you are getting error like: +If your installation fails with this error: ``` Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart... @@ -30,49 +30,85 @@ Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart... Error: client rate limiter Wait returned an error: Context deadline exceeded. ``` -There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured. +This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue. -To verify check if there are any `Pending` pods: +#### Diagnosis Steps -```shell -kubectl get pods -n netdata-cloud | grep -v Running -``` +> **Important** +> +> - For full installation: Ensure you're in the correct cluster context. +> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured. +> - For Light PoC, always perform a complete uninstallation before attempting a new installation. -To check which resource is a limiting factor pick one of the `Pending` pods and issue command: +1. Check for pods stuck in Pending state: -```shell -kubectl describe pod -n netdata-cloud -``` + ```shell + kubectl get pods -n netdata-cloud | grep -v Running + ``` -At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear. -Please check the minimum requirements for your on-prem installation type or contact our support - `support@netdata.cloud`. +2. If you find Pending pods, examine the resource constraints: -> **Warning** -> -> In case of the Light PoC installations always uninstall before the next attempt. + ```shell + kubectl describe pod -n netdata-cloud + ``` + + Review the Events section at the bottom of the output. Look for messages about: + - Insufficient CPU + - Insufficient Memory + - Node capacity issues + +3. View overall cluster resources: + + ```shell + # Check resource allocation across nodes + kubectl top nodes + + # View detailed node capacity + kubectl describe nodes | grep -A 5 "Allocated resources" + ``` -### Installation finished but login does not work +#### Solution -It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants: +1. Compare your available resources against the [minimum requirements](/docs/netdata-cloud/versions/on-prem/installation#system-requirements). +2. Take one of these actions: + - Add more resources to your cluster. + - Free up existing resources. -1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct. -2. Mail login is not working: - 1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token. - 2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings. +### Login Issues After Installation -If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem. +Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation. + +| Issue | Symptoms | Cause | Solution | +|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------| +| SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs
- Expired/invalid SSO tokens
- Untrusted certificates
- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`
- Verify certificates are valid and trusted
- Ensure FQDN matches certificate | +| MailCatcher Login (Light PoC) | - Magic links not arriving
- "Invalid token" errors | - Incorrect hostname during installation
- Modified default MailCatcher values | - Reinstall with correct FQDN
- Restore default MailCatcher settings
- Ensure hostname matches certificate | +| Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration
- Network connectivity issues | - Update SMTP settings in `values.yaml`
- Verify network allows SMTP traffic
- Check mail server logs | +| Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched `netdata-cloud-common` secret
- Database hash mismatch
- Namespace change without secret migration | - Migrate secret before namespace change
- Perform fresh installation
- Contact support for data recovery | > **Warning** > -> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation. +> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated. +> +> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts. ### Slow Chart Loading or Chart Errors When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents. | Issue | Symptoms | Cause | Solution | -| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Agent Connectivity | - Queries stall or timeout
- Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points) nodes to provide reliable backends. The system will automatically prefer these for queries when available | | Kubernetes Resources | - Service throttling
- Slow data processing
- Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed | | Database Performance | - Slow query responses
- Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization:
- CPU usage
- Memory allocation
- Disk I/O performance | | Message Broker | - Delayed node status updates (online/offline/stale)
- Slow alert transitions
- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration
- Adjust microservice resource allocation
- Monitor message processing rates | + +## Need Help? + +If issues persist: + +1. Gather the following information: + + - Installation logs + - Your cluster specifications + +2. Contact support at `support@netdata.cloud`.