Skip to content

Horizon Process Monitoring

Dave Wong edited this page Dec 16, 2020 · 2 revisions
  • SSH into the worker you want to set up monitoring for
  • Create this configuration /etc/datadog-agent/conf.d/process.d/conf.yaml
    • You need sudo to create that configuration
## All options defined here are available to all instances.
#
init_config:

    ## @param pid_cache_duration - integer - optional - default: 120
    ## Changes the check refresh rate of the matching pid list every X seconds except if it
    ## detects a change before. You might want to set it low if you want to
    ## alert on process service checks.
    #
    # pid_cache_duration: 120

    ## @param access_denied_cache_duration - integer - optional - default: 120
    ## The check maintains a list of PIDs for which it got access denied. It won't try to look at them again for the
    ## duration in seconds specified by access_denied_cache_duration.
    #
    # access_denied_cache_duration: 120

    ## @param shared_process_list_cache_duration - integer - optional - default: 120
    ## The check maintains a list of running processes shared among all instances, that is used to generate the
    ## matching pid list on each instance. It won't try to look at them again for the duration in seconds
    ## specified by shared_process_list_cache_duration.
    #
    # shared_process_list_cache_duration: 120

    ## @param procfs_path - string - optional
    ## Used to override the default procfs path, e.g. for docker containers with the outside fs mounted at /host/proc
    ## DEPRECATED: please specify `procfs_path` globally in `datadog.conf` instead
    #
    # procfs_path: /proc

    ## @param service - string - optional
    ## Attach the tag `service:<SERVICE>` to every metric, event, and service check emitted by this integration.
    ##
    ## Additionally, this sets the default `service` for every log source.
    #
    # service: <SERVICE>

## Every instance is scheduled independent of the others.
#
instances:

    ## @param name - string - required
    ## Used to uniquely identify your metrics as they are tagged with this name in Datadog.
    #
  - name: Horizon

    ## @param search_string - list of strings - optional
    ## If one of the elements in the list matches, it return the count of
    ## all the processes that match the string exactly by default. Change this behavior with the
    ## parameter `exact_match: false`.
    ##
    ## Note: One and only one of search_string, pid or pid_file must be specified per instance.
    #
    search_string:
       - horizon

    ## @param exact_match - boolean - optional - default: true
    ## Matches your search_string on proc.name().
    ## If you want to match on a substring within proc.cmdline(), set this to false
    ## Regex is also supported when this flag is set to `false`.
    ##
    ## Note: agent v6.11+ on windows runs as an unprivileged `ddagentuser` that does not have acces to the full
    ## command line of processes running under a different user. This option cannot be used in such cases.
    ## https://docs.datadoghq.com/integrations/process/#configuration
    #
    exact_match: false

    ## @param thresholds - mapping - optional
    ## The threshold parameter is composed of two ranges: critical and warning
    ##   * warning: (optional) List of two values: If the number of processes found is below the first value or
    ##              above the second one, the process check returns WARNING. To make an semi-unbounded interval,
    ##              use `.inf` for the upper bound.
    ##   * critical: (optional) List of two values: If the number of processes found is below the first value or
    ##               above the second one, the process check returns CRITICAL. To make an semi-unbounded interval,
    ##                use `.inf` for the upper bound.
    #
    # thresholds:
    #   warning:
    #   - <BELOW_VALUE>
    #   - <TOP_VALUE>
    #   critical:
    #   - <BELOW_VALUE>
    #   - <TOP_VALUE>

    ## @param collect_children - boolean - optional - default: false
    ## If true, the check also collects metrics from all child processes of a matched process.
    ## Please be aware that the collection is recursive, and might take some time depending on the use case.
    #
    # collect_children: false

    ## @param tags - list of strings - optional
    ## A list of tags to attach to every metric and service check emitted by this instance.
    ##
    ## Learn more about tagging at https://docs.datadoghq.com/tagging
    #
    # tags:
    #   - <KEY_1>:<VALUE_1>
    #   - <KEY_2>:<VALUE_2>

    ## @param service - string - optional
    ## Attach the tag `service:<SERVICE>` to every metric, event, and service check emitted by this integration.
    ##
    ## Overrides any `service` defined in the `init_config` section.
    #
    # service: <SERVICE>

    ## @param min_collection_interval - number - optional - default: 15
    ## This changes the collection interval of the check. For more information, see:
    ## https://docs.datadoghq.com/developers/write_agent_check/#collection-interval
    #
    min_collection_interval: 60
  • Restart the Datadog agent sudo service datadog-agent restart
  • On datadog go to the Soapbox API - DO dashboard
  • Copy and paste one of the soapbox-worker-1XX Horizon widgets
  • Edit the new widged (click pencil)
  • In the reported by dropdown look for the new worker
    • Sometimes it takes a few minutes for the new monitor to register on Datadog
    • If after a few minutes you still don't see it check the logs on the server to see if there are any errors tail -f /var/log/datadog/agent.log
    • You can restart the agent then tail the logs to force the agent to redo all the checks