Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana log monitoring and error alerting example #1463

Merged
merged 21 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
9e7a417
add log monitoring to grafana dfp example
efajardo-nv Jan 10, 2024
ef4bc12
update copyright years
efajardo-nv Jan 10, 2024
51e630f
add dfp logs dashboard screenshot
efajardo-nv Jan 11, 2024
433d1f2
update grafana and loki versions
efajardo-nv Jan 11, 2024
9ac1f80
add error alerting section to readme
efajardo-nv Jan 11, 2024
bd8b018
add loki logging handler snippet to readme
efajardo-nv Jan 12, 2024
fe2e2b5
revert grafana version
efajardo-nv Jan 16, 2024
f2d594c
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Jan 24, 2024
aee5a23
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Feb 8, 2024
b758d89
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Feb 9, 2024
7625ac9
update logger to accept additional handlers
efajardo-nv Feb 9, 2024
d17b3a4
readme update
efajardo-nv Feb 9, 2024
1acfe2a
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Feb 9, 2024
1fd04a6
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Feb 13, 2024
305c05d
update deprecated_stage_warning to accept reason
efajardo-nv Feb 14, 2024
b2569e9
add logger tests
efajardo-nv Feb 14, 2024
3b509dd
assert update and style fixes
efajardo-nv Feb 14, 2024
edde7a9
add logger info to docs
efajardo-nv Feb 14, 2024
67661f8
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Feb 15, 2024
f10df97
remove invalid link from updated doc
efajardo-nv Feb 15, 2024
903901c
fix test
efajardo-nv Feb 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/source/developer_guide/guides/1_simple_python_stage.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,19 @@ Before constructing the pipeline, we need to do a bit of environment configurati
```python
configure_logging(log_level=logging.DEBUG)
```
We use the default configuration with the `DEBUG` logging level. The logger will output to both the console and a file. The logging handlers are non-blocking since they utilize a queue to send the log messages on a separate thread.

We can also use `configure_logging` to add one or more logging handlers to the default configuration. The added handlers will also be non-blocking. The following is from the
[Grafana example](../../../../examples/digital_fingerprinting/production/grafana/README.md) where we add a [Loki](https://grafana.com/oss/loki/) logging handler to also publish Morpheus logs to a Loki log aggregation server.
```python
loki_handler = logging_loki.LokiHandler(
url=f"{loki_url}/loki/api/v1/push",
tags={"app": "morpheus"},
version="1",
)

configure_logging(loki_handler, log_level=log_level)
mdemoret-nv marked this conversation as resolved.
Show resolved Hide resolved
```

Next, we will build a Morpheus `Config` object. We will cover setting some common configuration parameters in the next guide. For now, it is important to know that we will always need to build a `Config` object:
```python
Expand Down
4 changes: 4 additions & 0 deletions examples/digital_fingerprinting/production/conda_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,7 @@ dependencies:
- nvtabular=23.06
- papermill
- s3fs>=2023.6

##### Pip Dependencies (keep sorted!) #######
- pip:
- python-logging-loki
19 changes: 18 additions & 1 deletion examples/digital_fingerprinting/production/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,26 @@ services:
- ./grafana/config/dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml
- ./grafana/dashboards/:/var/lib/grafana/dashboards/
- ./grafana/datasources/:/etc/grafana/provisioning/datasources/
- ./morpheus:/workspace
- ./grafana:/workspace
ports:
- "3000:3000"
networks:
- frontend
- backend
depends_on:
- loki

loki:
image: grafana/loki:2.9.3
volumes:
- ./grafana/config/loki-config.yml:/etc/loki/loki-config.yml
ports:
- "3100:3100"
networks:
- frontend
- backend
restart: unless-stopped
command: -config.file=/etc/loki/loki-config.yml

networks:
frontend:
Expand Down
128 changes: 98 additions & 30 deletions examples/digital_fingerprinting/production/grafana/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,31 @@
# limitations under the License.
-->

# Grafana DFP Dashboard Example
# Using Grafana with Morpheus DFP Pipeline

This example demonstrates how to use [Grafana](https://grafana.com/grafana/) to visualize the inference results from the [Azure DFP pipeline example](../production/README.md).
This example builds on the [Azure DFP pipeline example](../production/README.md) to demonstrate how [Grafana](https://grafana.com/grafana/) can be used for log monitoring, error alerting, and inference results visualization.

## Grafana Configuration

### CSV data source plugin
The data sources and dashboards in this example are managed using config files. [Grafana's provisioning system](https://grafana.com/docs/grafana/latest/administration/provisioning/) then uses these files to add the data sources and dashboards to Grafana upon startup.

The [CSV data source plugin](https://grafana.com/grafana/plugins/marcusolsson-csv-datasource/) is installed to Grafana to read the Azure inference results CSV file. This example assumes we are using the CSV file generated from running the Python script for [Azure DFP pipeline example](../production/README.md).
### Data Sources

If using the [notebook version](../production/morpheus/notebooks/dfp_azure_inference.ipynb) to run inference, you'll need to update the `url` in [datasources.yaml](./datasources/datasources.yaml) as follows:
```
url: /workspace/notebooks/dfp_detections_azure.csv
```
Grafana includes built-in support for many data sources. There are also several data sources available that can be installed as plugins. More information about how to manage Grafana data sources can be found [here](https://grafana.com/docs/grafana/latest/datasources/).

The following data sources for this example are configured in [datasources.yaml](./datasources/datasources.yaml):

#### Loki data source

[Loki](https://grafana.com/docs/loki/latest/) is Grafana's log aggregation system. The Loki service is started automatically when the Grafana service starts up. The [Python script for running the DFP pipeline](./run.py) has been updated to configure a logging handler that sends the Morpheus logs to the Loki service.

#### CSV data source plugin

The [CSV data source plugin](https://grafana.com/grafana/plugins/marcusolsson-csv-datasource/) is installed to Grafana to read the Azure inference results CSV file. This example assumes we are using the CSV file generated from running the Python script for [Azure DFP pipeline example](../production/README.md).

Please note that the use of the CSV plugin is for demonstration purposes only. Grafana includes support for many data sources more suitable for production deployments. See [here](https://grafana.com/docs/grafana/latest/datasources/) for more information.

### Updates to grafana.ini
#### Updates to grafana.ini

The following is added to the default `grafana.ini` to enable local mode for CSV data source plugin. This allows the CSV data source plugin to access files on local file system.

Expand All @@ -40,14 +47,24 @@ The following is added to the default `grafana.ini` to enable local mode for CSV
allow_local_mode = true
```

## Run Azure Production DFP Training and Inference Examples
## Add Loki logging handler to DFP pipeline

### Start Morpheus DFP pipeline container
The [pipeline run script](./run.py) for the Azure DFP example has been updated with the following to add the Loki logging handler which will publish the Morpheus logs to our Loki service:

The following steps are taken from [Azure DFP pipeline example](../production/README.md). Run the followng commands to start the Morpheus container:
```
loki_handler = logging_loki.LokiHandler(
url=f"{loki_url}/loki/api/v1/push",
tags={"app": "morpheus"},
version="1",
)

Build the Morpheus container:
configure_logging(loki_handler, log_level=log_level)
```

More information about Loki Python logging can be found [here](https://pypi.org/project/python-logging-loki/).

## Build the Morpheus container:
From the root of the Morpheus repo:
```bash
./docker/build_container_release.sh
```
Expand All @@ -60,45 +77,96 @@ export MORPHEUS_CONTAINER_VERSION="$(git describe --tags --abbrev=0)-runtime"
docker compose build
```

Create `bash` shell in `morpheus_pipeline` container:
## Start Grafana and Loki services:

To start Grafana and Loki, run the following command on host in `examples/digital_fingerprinting/production`:
```bash
docker compose run morpheus_pipeline bash
docker compose up grafana
```

### Run Azure training pipeline
## Run Azure DFP Training

Run the following in the container to train Azure models.
Create `bash` shell in `morpheus_pipeline` container:

```bash
python dfp_azure_pipeline.py --log_level INFO --train_users generic --start_time "2022-08-01" --input_file="../../../data/dfp/azure-training-data/AZUREAD_2022*.json"
docker compose run --rm morpheus_pipeline bash
```

### Run Azure inference pipeline:

Run the inference pipeline with `filter_threshold=0.0`. This will disable the filtering of the inference results.
Set `PYTHONPATH` environment variable to allow import of production DFP Morpheus stages:
```
export PYTHONPATH=/workspace/examples/digital_fingerprinting/production/morpheus
```

Run the following in the container to train the Azure models.
```bash
python dfp_azure_pipeline.py --log_level INFO --train_users none --start_time "2022-08-30" --input_file="../../../data/dfp/azure-inference-data/*.json" --filter_threshold=0.0
cd /workspace/examples/digital_fingerprinting/production/grafana
python run.py --log_level DEBUG --train_users generic --start_time "2022-08-01" --input_file="../../../data/dfp/azure-training-data/AZUREAD_2022*.json"
```

The inference results will be saved to `dfp_detection_azure.csv` in the directory where script was run.
## View DFP Logs Dashboard in Grafana

While the training pipeline is running, you can view Morpheus logs live in a Grafana dashboard at http://localhost:3000/dashboards.

Click on `DFP Logs` in the `General` folder. You may need to expand the `General` folder to see the link.

<img src="./img/dfp_logs_dashboard.png">

This dashboard was provisioned using config files but can also be manually created with the following steps:
1. Click `Dashboards` in the left-side menu.
2. Click `New` and select `New Dashboard`.
3. On the empty dashboard, click `+ Add visualization`.
4. In the dialog box that opens, Select the `Loki` data source.
5. In the `Edit Panel` view, change from `Time Series` visualization to `Logs`.
6. Add label filter: `app = morpheus`.
7. Change Order to `Oldest first`.
8. Click `Apply` to see your changes applied to the dashboard. Then click the save icon in the dashboard header.

## Run Grafana Docker Image
## Set up Error Alerting

To start Grafana, run the following command on host in `examples/digital_fingerprinting/production`:
We demonstrate here with a simple example how we can use Grafana Alerting to notify us of a pipeline error moments after it occurs. This is especially useful with long-running pipelines.

1. Click `Alert Rules` under `Alerting` in the left-side menu.
2. Click `New Alert Rule`
3. Enter alert rule name: `DFP Error Alert Rule`
4. In `Define query and alert condition` section, select `Loki` data source.
5. Switch to `Code` view by clicking the `Code` button on the right.
6. Enter the following Loki Query which counts the number of log lines in last minute that have an error label (`severity=error`):
```
docker compose up grafana
count_over_time({severity="error"}[1m])
```
7. Under `Expressions`, keep default configurations for `Reduce` and `Threshold`. The alert condition threshold will be error counts > 0.
7. In `Set evaluation behavior` section, click `+ New folder`, enter `morpheus` then click `Create` button.
8. Click `+ New evaluation group`, enter `dfp` for `Evaluation group name` and `1m` for `Evaluation interval`, then click `Create` button.
9. Enter `0s` for `Pending period`. This configures alerts to be fired instantly when alert condition is met.
10. Test your alert rule, by running the following in your `morpheus_pipeline` container. This will cause an error because `--input-file` glob will no longer match any of our training data files.
```
python run.py --log_level DEBUG --train_users generic --start_time "2022-08-01" --input_file="../../../data/dfp/azure-training-data/AZUREAD_2022*.json"
```
11. Click the `Preview` button to test run the alert rule. You should now see how our alert query picks up the error log, processes it through our reduce/threshold expressions and satisfies our alert condition. This is indicated by the `Firing` label in the `Threshold` section.

<img src="./img/dfp_error_alert_setup.png">

12. Finally, click `Save rule and exit` at top right of the page.

By default, all alerts will be sent through the `grafana-default-email` contact point. You can add email addresses to this contact point by clicking on `Contact points` under `Alerting` in the left-side menu. You would also have to configure SMTP in the `[smtp]` section of your `grafana.ini`. More information about about Grafana Alerting contact points can found [here](https://grafana.com/docs/grafana/latest/alerting/fundamentals/contact-points/).

## Run Azure DFP Inference:

Run the inference pipeline with `filter_threshold=0.0`. This will disable the filtering of the inference results.

```bash
python run.py --log_level DEBUG --train_users none --start_time "2022-08-30" --input_file="../../../data/dfp/azure-inference-data/*.json" --filter_threshold=0.0
```

The inference results will be saved to `dfp_detection_azure.csv` in the directory where script was run.

## View DFP Dashboard
## View DFP Detections Dashboard in Grafana

Our Grafana DFP dashboard can now be accessed via web browser at http://localhost:3000/dashboards.
When the inference pipeline completes, you can view visualizations of the inference results at http://localhost:3000/dashboards.

Click on `DFP_Dashboard` in the `General` folder. You may need to expand the `General` folder to see the link.
Click on `DFP Detections` in the `General` folder. You may need to expand the `General` folder to see the link.

<img src="./img/screenshot.png">
<img src="./img/dfp_detections_dashboard.png">

The dashboard has the following visualization panels:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@
;token_rotation_interval_minutes = 10

# Set to true to disable (hide) the login form, useful if you use OAuth, defaults to false
disable_login_form = true
;disable_login_form = true

# Set to true to disable the sign out link in the side menu. Useful if you use auth.proxy or auth.jwt, defaults to false
;disable_signout_menu = false
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

auth_enabled: false

server:
http_listen_port: 3100
grpc_listen_port: 9096

common:
instance_addr: 127.0.0.1
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory

query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100

schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h

ruler:
alertmanager_url: http://localhost:9093

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
# reporting_enabled: false
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@
},
"timepicker": {},
"timezone": "",
"title": "DFP_Dashboard",
"title": "DFP Detections",
"uid": "f810d98f-bf31-42d4-98aa-9eb3fa187184",
"version": 1,
"weekStart": ""
Expand Down
Loading
Loading