NanoMon is a lightweight network and HTTP monitoring system, designed to be self hosted any container based system e.g. Kubernetes, or just run locally. It is written in Go and based on the now ubiquitous microservices pattern, so decomposed into several discreet but interlinked components. The features of NanoMon include:
- A range of configurable monitor types
- Web frontend for viewing results & editing/creating monitors
- Email alerting
- Range of deployment options
- Rules for setting monitor status and evaluating results
- OAuth2 based user sign-in and authentication
- Exporting of metrics & data to Prometheus
It also serves as a reference & learning app for microservices and is used by my Kubernetes workshop as the workload & application deployed in order to demonstrate Kubernetes concepts.
In a hurry? - Jump to the sections running locally quick start or deploying with Helm
The architecture is a fairly standard design, consisting of four application components and a database.
- API - API provides the main interface for the frontend and any custom clients. It is RESTful and runs over HTTP(S). It connects directly to the MongoDB database.
- Runner - Monitor runs are executed from here (see concepts below). It connects directly to the MongoDB database, and reads monitor configuration data, and saves back & stores result data.
- Frontend - The web interface is a SPA (single page application), consisting of a static set of HTML, JS etc which executes from the user's browser. It connects directly to the API, and is developed using Alpine.js](https://alpinejs.dev/)
- Frontend Host - The static content host for the frontend app, which contains no business logic. This simply serves frontend application files HTML, JS and CSS files over HTTP. In addition it exposes a small configuration endpoint.
- MongoDB - Backend data store, this is a vanilla instance of MongoDB. Cloud and hosted services which provide MongoDB compatibility (e.g. Azure Cosmos DB) also work
NanoMon executes monitoring calls remotely over the network using standard protocols, it does this periodically on a set interval per monitor. The results & execution of a "run" is validated to determine the status or success. There are currently three statuses:
- OK – Indicates no problems, e.g. got a HTTP valid response.
- Error – Partial success as one or more rules failed, e.g. HTTP status code wasn't the expected value. See rules below.
- Failed – The monitor failed to run entirely e.g. connection, network or DNS failure.
A monitor represents an instance of a given monitor type (see below) with it's associated configuration. Common properties of all monitors include the interval on which they are run, and the target. The target is type dependant but typically is a hostname or URL.
When a monitor runs it generates a result. The result as the name implies, holds the results of a run of a monitor, such as the timestamp, status, message and a value. The value of a result is dependant on the type of monitor however it most commonly represents the duration of the network request in milliseconds.
There are three types of monitor currently supported:
- HTTP – Makes HTTP(S) requests to a given URL and measures the response time.
- Ping – Carries out an ICMP ping to the target hostname or IP address.
- TCP – Attempts to create a TCP socket connection to the given hostname and port.
For more details see the complete monitor reference
📂
├── api - API reference and specifications, using TypeSpec
├── build - Dockerfiles and supporting build artifacts
├── deploy
│ ├── azure - Deploy to Azure using Bicep
│ ├── helm - Helm chart to deploy NanoMon
│ └── kubernetes - Example Kubernetes manifests (No Helm)
├── etc - Misc stuff :)
├── frontend - The HTML/JS source for the frontend app
├── scripts - Supporting helper bash scripts
├── services
│ ├── api - Go source for the API service
│ ├── common - Shared internal Go code
│ ├── frontend - Go source for the frontend host server
│ └── runner - Go source for the runner
├── templates - Email alert template used by runner
└── tests - Integration and performance tests
Here are the most common options for quickly getting started running locally, or deploying to the cloud or Kubernetes.
Pre-reqs:
- Linux system like Ubuntu (WSL2 was used for development), MacOS might work 🤷♂️
- Docker engine & CLI
- Go SDK & toolchain
- Just task runner and make replacement
- If you're not keen on installing another binary on your system or into your PATH, run
./scripts/install-just.sh
which puts the binary into a local project folder (.tools/
), then you can run.tools/just
- When working locally, copy the
.env.sample
to.env
and set any configuration variables in the.env
file.
- If you're not keen on installing another binary on your system or into your PATH, run
To run all the components directly on your dev machine. You will need to be using a Linux compatible system (e.g. WSL or a MacOS) with bash, make, Go, Docker & Node.js installed. You can try the provided devcontainer if you don't have these pre-reqs.
- Run
just install
- Run
just run-all
- The frontend should automatically open in your browser.
If you just want to try the app out, you can start the standalone image using Docker. This doesn't require you to have Go, Node.js etc
docker pull ghcr.io/benc-uk/nanomon-standalone:latest
docker run --rm -it -p 8000:8000 -p 8001:8001 ghcr.io/benc-uk/nanomon-standalone:latest
Then open the following URL http://localhost:8001/
- Written in Go, source code - /services/runner
- The runner requires a connection to MongoDB in order to start, it will exit if the connection fails.
- It keeps in sync with the
monitors
collection in the database, it does this one of two ways:- Watching the collection using MongoDB change stream. This mode is preferred as it results in instant updates to changes made in the frontend & UI
- If change stream isn't supported, then the runner falls back to polling the database for changes.
- If configured the runner will send email alerts, see alerting section below
- By default runner doesn't listen to inbound network connections or bind to any ports, the exception being if Prometheus support is enabled
- Written in Go, source code - /services/api
- The runner requires a connection to MongoDB in order to start, it will exit if the connection fails.
- Listens on port 8000 by default.
- All routes are prefixed
/api
this makes it easier to put a path based HTTP router in front of the API and the SPA frontend - Makes use of the benc-uk/go-rest-api package.
- The API is RESTful, see the API folder for specifications and sample .http file.
- By default no there is no authentication or validation, and all API calls are allowed, see authentication & security section for details.
- Written in "modern" ES6 JavaScript using Alpine.js for reactivity and as a lightweight SPA framework source code - /frontend
- No bundling, webpack or Node is required 😊
- Vite is used but just as a dev-server for serving the site locally
- Configuration is fetched from the URL
/config
at start up.- When hosted by the frontend-host this allows for values to be dynamically passed to the frontend at runtime.
- When running locally the makefile target
just run-frontend
builds a static config file to "fake" this config API.
- By default no there is no authentication on the frontend, this makes the app easy to use for demos & workshops. However it can be enabled see authentication & security section for details. The MSAL library is used for auth see MSAL.js 2.0 for Browser-Based SPAs
- Written in Go, source code - /services/frontend (Note. Don't confuse with the
/frontend
directory) - A simple static HTTP server for hosting & serving the content & files of the frontend app
- Listens on port 8001 by default.
- Provides a single special API endpoint served at
/config
which reflects back to the frontend certain environmental variables (see configuration below)
🔸build # 🔨 Build all binaries into ./bin/ directory, not really needed
🔸clean # 🧹 Clean up, remove dev data and files
🔸format # 📝 Format source files and fix linting problems
🔸generate-specs # 🤖 Generate OpenAPI specs and JSON-Schemas using TypeSpec
🔸image-standalone # 📦 Build the special standalone all-in-one image
🔸images # 📦 Build all container images, using Docker compose
🔸install # 🔮 Install dev tools into project tools directory
🔸lint fix="false" # 🔍 Lint & format, default is to run lint check only and set exit code
🔸push # 📤 Push all container images
🔸run-all # 🚀 Run all services locally with hot-reload, plus MongoDB
🔸run-api # 🎯 Run the API service locally, with hot reloading
🔸run-db # 🍃 Run MongoDB in container (needs Docker)
🔸run-frontend # 🌐 Run frontend with Vite dev HTTP server & hot-reload
🔸run-runner # 🏃 Run the runner service locally, with hot reloading
🔸test # 🧪 Run all unit tests
🔸test-api report="false" # 🔬 Run API integration tests, using HttpYac
All three components (API, runner and frontend host) expect their configuration in the form of environmental variables. When running locally this is done via a .env
file. Note. The .env
file is not used when deploying or running the app elsewhere
Name | Description | Default |
---|---|---|
API_ENDPOINT | Instructs the frontend SPA where to find the API | /api |
Name | Description | Default |
---|---|---|
MONGO_URI | Connection string for MongoDB | mongodb://localhost:27017 |
MONGO_DB | Database name to use | nanomon |
MONGO_TIMEOUT | Timeout for connecting to MongoDB | 30s |
Name | Description | Default |
---|---|---|
PORT | TCP port for service to listen on | 8000 & 8001 |
AUTH_CLIENT_ID | Used to enable authentication with given Azure AD app client ID. See auth section | blank |
AUTH_TENANT | Set to Azure AD tenant ID if not using common | common |
Note. All settings for alerting that begin with
ALERT_
are optional
Name | Description | Default |
---|---|---|
ALERT_SMTP_PASSWORD | For alerting, the password for mail server | blank |
ALERT_SMTP_FROM | From address for alerts, also used as the username | blank |
ALERT_SMTP_TO | Address alert emails are sent to | blank |
ALERT_SMTP_HOST | SMTP hostname | smtp.gmail.com |
ALERT_SMTP_PORT | SMTP port | 587 |
ALERT_FAIL_COUNT | How many times a monitor returns a non-OK status, to trigger an alert email | 3 |
ALERT_LINK_BASEURL | When hosting NanoMon and you want the link in alert emails to point to the correct URL | http://localhost:3000 |
POLLING_INTERVAL | Only used when in polling mode, when change stream isn't available | 10s |
USE_POLLING | Force polling mode, by default MongoDB change streams will be tried, and polling mode used if that fails. | false |
PROMETHEUS_ENABLE | Enable exporting metrics in Prometheus format (see below) | false |
PROMETHEUS_PORT | HTTP port used to serve the Prometheus metrics | 8080 |
NanoMon currently supports four types of monitor, which can be configured various ways, this is a reference for each monitor type, the runtime behaviour, properties that can be set, and the resulting outputs.
This makes a single HTTP request to the target URL each time it is run, it will return failed status in the event of network failure e.g. no network connection, unable to resolve name with DNS, invalid URL etc. Otherwise any sort of HTTP response will return an OK status. If you want to check the HTTP response code, use a rule as described above e.g. status == 200
or status >= 200 && status < 300
.
- Target: A URL, with HTTP scheme
http://
orhttps://
- Value: Time to complete the HTTP request & read the response in milliseconds.
- Properties:
- method - Which HTTP method to use (default: "GET")
- timeout - Timeout interval e.g. "10s" or "500ms" (default: 5s)
- validateTLS - Set to "false" to disable TLS cert validation (default: "true")
- body - Body string to send with the HTTP request (default: none)
- headers - HTTP headers as JSON object, e.g.
{"content-type": "application/json"}
(default: none) - bodyRegex - Run this regEx against the body, and sets
regexMatch
output (default: none)
- Outputs / Rule Props:
- respTime - Same as monitor value (number)
- status - HTTP status code (number)
- body - The response body (string)
- bodyLen - The number of bytes in the response (number)
- certExpiryDays - Number of days before the TLS cert of the site expires (number)
- regexMatch - Match of the bodyRegex if any (number or string)
Each time a TCP monitor runs it attempts to open a TCP connection to given host on the given port, it will return failed status in the event of network/connection failure, DNS resolution failure, or if the port is closed or blocked. Otherwise it will return OK.
- Target: A hostname (or IP address) and port tuple, separated by colon
- Value: Time for TCP connection to open in milliseconds.
- Properties:
- timeout - Timeout interval e.g. "10s" or "500ms" (default: 5s)
- Outputs / Rule Props:
- respTime - Same as monitor value (number)
- ipAddress - Resolved IP address of the target (string)
This monitor will send one or more ICMP ping packets to the given host or IP address, it will return failed status in the event of network/connection failure, unable to resolve name with DNS Otherwise it will return OK.
Note. As this monitor needs to send ICMP packets, the runner process needs certain OS privileges to do that otherwise you will see socket: operation not permitted
errors. When running inside a container it runs as root so there is no issue. When running locally if you want to use this monitor type, build the runner binary with just build
then start the runner process with sudo e.g. sudo ./bin/runner
- Target: A hostname or IP address.
- Value: Average round trip time in milliseconds.
- Properties:
- timeout - Timeout interval e.g. "10s" or "500ms" (default: 1s)
- count - Number of packets to send (default: 3)
- interval - Interval between packets (default: 150ms)
- Outputs / Rule Props:
- minRtt - Min round trip time of the packets (number)
- avgRtt - Avg round trip time of the packets (number)
- maxRtt - Max round trip time of the packets (number)
- packetsRecv - How many packets were received (number)
- packetLoss - Percentage of packet that were lost (number)
- ipAddress - Resolved IP address of the target (string)
The DNS monitor looks up DNS records and returns the results as outputs, if the name fails to resolve it will return failed status, otherwise it will return OK.
- Target: The domain or hostname you want to lookup in DNS
- Value: Time for lookup to complete
- Properties:
- timeout - Timeout interval e.g. "500ms" (default: 2s)
- type - Type of DNS record to query, one of; 'A', 'CNAME', 'TXT', 'MX or 'NS' (default: 'A')
- server - Hostname or IP of DNS server to use for querying (default: use the DNS server configured in the OS of the runner)
- network - Network to use and sorts of addresses to return, one of; 'ip4', 'ip6' or 'ip' (default: 'ip')
- Outputs / Rule Props:
- respTime - Same as monitor value (number)
- resultCount - Number of records returned from the query (number)
- result1, result2 etc - Each result of the query returned as a separate numbered output (string)
All monitor types have a rule property as part of their configuration, this rule is a logical expression which is evaluated after each run. You can use any of the outputs in this expression in order to set the result status of the run.
The rule expression should always return a boolean, a false value will set the result to error status, anything else will leave the status as is (i.e. OK or failed) you can use a range of operators in the rule expression, such as logical AND
, OR
, NOT
etc plus other advanced operators like =~
for regex searching (e.g string contains).
Some rule examples:
status >= 200 && status < 300 # Check for OK range of HTTP status codes
status == 200 && respTime < 5000 # Check status code and response time
'93.184.215.14' IN (result1, result2) # Check IP in multiple DNS results
body =~ 'some words' # Look for a string in the HTTP body
regexMatch == 'a value' # Check the value of the RegEx match
By default there is no authentication, security or user sign-in. This is by design to make the app easy to deploy, and for use in learning scenarios and workshops.
Security is enabled using the Microsoft Identity Platform (now called Microsoft Entra ID) and OAuth2 + OIDC. With an app registered in Entra ID, then passing the app's client id as AUTH_CLIENT_ID
to the NanoMon services. Setting this changes the behaviour of the application as follows:
- API - Will enforce validation on certain API routes, like POST, PUT and DELETE, using OAuth 2.0 JWT bearer tokens. The token is checked for validity as follows; contains a scope matching
system.admin
and has an audience matching the client id. - Frontend host - The UI will show a sign-in button and only allow signed-in users to create, edit or delete monitors. Access tokens are fetched from Entra ID for the signed-in user with the
system.admin
scope, and then passed when calling the API as bearer tokens.
A basic guide to set this up:
- Register a new app in Microsoft Entra ID, his needs to have the API scope
system.admin
exposed, and also be set with the correct SPA redirect URLs. To simplify this creation, use the provided bash script:./scripts/aad-app-reg.sh
, or use the portal. - Test locally - Put the provided client id as the
AUTH_CLIENT_ID
, in your.env
file, then (re)start the frontend and API withjust run-api
and frontend withjust run-frontend
. You should see a login button on the page, and no way to create or edit monitors until you sign-in. - For deploying elsewhere:
- Get the frontend URL of the deployed running instance.
- Add this URL to the SPA redirect URIs in the app registration. It's probably easiest doing this in the Azure Portal.
- Update or redeploy app, setting
AUTH_CLIENT_ID
on the both frontend host and API containers.
NanoMon provides basic alerting support, which sends emails when monitors return a non-OK status 1 or more times in a row. By default this alerting feature is not enabled, and failing monitors will not trigger emails.
To enable alerting all of the env vars starting ALERT_
will need to be set, there are six of these as described above. However as three of these variables have defaults, you only need to set the remaining three ALERT_SMTP_PASSWORD
, ALERT_SMTP_FROM
and ALERT_SMTP_FROM
to switch the feature on, this will be using GMail to send emails. For the password you will need setup an Google app password this will use your personal Google account to send the emails, so this probably isn't a good option for production (putting it mildly).
Limitations:
- Only been tested with the GMail SMTP server, I have no idea if it'll work with others! ¯\_(ツ)_/¯
- The from address is also used as the login user to the SMTP server.
- Only a single email address can be set to send emails to.
- Restarting the runner will resend alerts for failing monitors.
- No follow up email is sent when a monitor returns to OK.
- The services will dynamically create the database and collections if they don't exist at startup. By default the database name is
nanomon
but this can be changed with theMONGO_DB
env var. - For change stream support to work MongoDB must be running as a replica set, when running locally this is enabled in the Docker container that is started. Also the Helm chart will deploy MongoDB as a replica set.
Azure Cosmos DB can be used as a database for NanoMon, however there are two things to consider:
- An index must be added for the
date
field to the results collection, this can be done in the Azure Portal or with a single command:
az cosmosdb mongodb collection update -a $COSMOS_ACCOUNT -g $COSMOS_RG -d nanomon -n results --idx '[{"key":{"keys":["_id"]}},{"key":{"keys":["date"]}}]'
- Cosmos DB for MongoDB does have support for change streams, however it comes with several limitations, most notably the lack of support for delete events. Given these limitations NanoMon will fall back to polling when using Cosmos DB
NanoMon has support for Prometheus metrics, which are exposed from the runner service via HTTP in the standard text-based exposition format. When configuring NanoMon as a scraping target use the url http://<runner-host>:8080/metrics
(the port can be changed with PROMETHEUS_PORT
)
This feature is disabled by default and is enabled by setting the PROMETHEUS_ENABLE
env var, when enabled the metrics can be fetched/scraped from the /metrics
endpoint. The active monitors will be provided as labelled Prometheus gauges (one gauge per monitor), these labels will hold the values for the monitor status (0 = OK, 1 = Error, 2 = Failed), and values of each numeric monitor output (string outputs are not applicable to Prometheus)
Using Prometheus means you many not need to run the NanoMon frontend, as you can visualize the data through other tools, and optionally enable things like the Prometheus alerts.
Example of metrics
# HELP nanomon_example_monitor Example Monitor (http)
# TYPE nanomon_example_monitor gauge
nanomon_example_monitor{id="6722474d0c73d60184f14c73",result="_status",type="http"} 0
nanomon_example_monitor{id="6722474d0c73d60184f14c73",result="_value",type="http"} 178
nanomon_example_monitor{id="6722474d0c73d60184f14c73",result="bodyLen",type="http"} 15256
nanomon_example_monitor{id="6722474d0c73d60184f14c73",result="respTime",type="http"} 178
nanomon_example_monitor{id="6722474d0c73d60184f14c73",result="status",type="http"} 404