Skip to content

Commit

Permalink
refactor: observability post
Browse files Browse the repository at this point in the history
  • Loading branch information
Vinicius Morais Dutra committed May 2, 2023
1 parent 5717c44 commit 65bd518
Show file tree
Hide file tree
Showing 8 changed files with 26,717 additions and 19,674 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ logs
npm-debug.log*
yarn-debug.log*
yarn-error.log*
.yarn

# Runtime data
pids
Expand Down
1 change: 0 additions & 1 deletion .prettierrc.json

This file was deleted.

1 change: 1 addition & 0 deletions .yarnrc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
nodeLinker: node-modules
64 changes: 37 additions & 27 deletions content/posts/obervability-metrics-and-monitoring/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,30 @@ tags:
- 2. [What are Metrics?](#what-are-metrics)
- 3. [What is Distributed Tracing?](#what-is-distributed-tracing)
- [What is Monitoring?](#what-is-monitoring)
- [Security Information Event Management (SIEM)](#what-is-security-information-event-management)

> “You can't manage what you can't measure.”
>
> -- <cite>Peter Drucker</cite>
---

# <a name="what-is-observability"></a> What is Observability?
# <a name="what-is-observability"></a>👀 What is Observability?
In Control Theory, Observability is a measure of internal states of system can be inferred from the knowledge of its external outputs.
Observability enables us to ask why.
`Observability enables us to ask why.`

Lets us understand a system from the outside, by letting us ask questions about that system without knowing its inner workings.
Furthermore allows us to easily troubleshoot and handle novel problems and helps us answer the question "Why is this happening".
Lets us understand a system from the outside, by letting us ask questions about that system without knowing its inner workings.
Furthermore allows us to easily troubleshoot and handle novel problems and helps us answer the question "Why is this happening".
In order to be able to ask those questions of a system the application must be properly instrumented.
That is the application code must emit signals such as traces, metrics and logs.
That is the application code must emit signals such as traces, metrics and logs.
An application is properly instrumented when developers don't need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.

Three Pillars:
- Metrics
- Logs (a register of an event that occurred)
- Tracing
### Three Pillars of Observability:
| Pillars | Description |
---------- | -------------------------------------- |
| Metrics | measures of quantitative assessment commonly used for comparing, and tracking performance or production.|
| Logs | a register of an event that occurred |
| Tracing | tracing is the mechanism to following a program's flow and data progression|

## <a name="what-are-logs"></a> 1. What are Logs?
Logs are either events that occurred in the software. Logging is the act of keeping a log.
Expand All @@ -52,26 +55,32 @@ Metrics are values captured on your systems at a specific point in time. <br />

Can be collected once per second, one per minute or at another regular interval to monitor a system over time.

Two important categories of metrics:
- Work Metrics
- Resource Metrics

**Work Metrics**
- **THROUGHPUT:** is the amount of work the system is doing per unit time. Throughput is usually recorded as an absolute number.
- **SUCCESS:** metrics represent the percentage of work that was executed successfully.
- **ERROR:** metrics capture the number of erroneous results, usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work. Error metrics are often captured separately from success metrics when there are several potential sources of error, some of which are more serious or actionable than others.
- **PERFORMANCE:** metrics quantify how efficiently a component is doing its work. The most common performance metric is latency, which represents the time required to complete a unit of work. Latency can be expressed as an average or as a percentile, such as “99% of requests returned within 0.1s”.

**Resource Metrics**
- **UTILIZATION:** is the percentage of time that the resource is busy, or the percentage of the resource's capacity that is in use.
- **SATURATION:** is a measure of the amount of requested work that the resource cannot yet service, often queued.
- **ERROR:** represent internal errors that may not be observable in the work the resource produces.
- **AVAILABILITY:** represents the percentage of time that the resource responded to requests. This metric is only well-defined for resources that can be actively and regularly checked for availability.
##### 2.1. Work Metrics
| Metric | Description |
| ------ | ----------- |
| Throughput | → is the amount of work the system is doing per unit time. Throughput is usually recorded as an absolute number. |
| Success | → metrics represent the percentage of work that was executed successfully. |
| Error | → metrics capture the number of erroneous results, usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work. Error metrics are often captured separately from success metrics when there are several potential sources of error, some of which are more serious or actionable than others. |
| Performance | → metrics quantify how efficiently a component is doing its work. The most common performance metric is latency, which represents the time required to complete a unit of work. Latency can be expressed as an average or as a percentile, such as “99% of requests returned within 0.1s”. |

##### 2.2. Resource Metrics
| Metric | Description |
| ------ | ----------- |
| Saturation | → metrics measure the degree to which a resource has been consumed. |
| Utilization | → metrics measure the percentage of the time that a resource was busy or active. |
| Availability | → metrics measure the percentage of time that a system or resource was ready to perform work. |
| Capacity | → metrics measure the total capacity of a resource or group of resources. |
| Count | → metrics measure the number of occurrences of a particular event, usually expressed as a rate of events per unit time. |
| Time | → metrics measure the amount of time that a system or resource was in a particular state. |
| Error | → metrics measure the number of error events that occurred, usually expressed as a rate of errors per unit time. |

#### Good Data
- **Well-understood.** You should be able to quickly determine how each metric or event was captured and what it represents. During an outage you won’t want to spend time figuring out what your data means. Keep your metrics and events as simple as possible, use standard concepts described above, and name them clearly.

- **Granular.** If you collect metrics too infrequently or average values over long windows of time, you may lose the ability to accurately reconstruct a system’s behavior. For example, periods of 100% resource utilization will be obscured if they are averaged with periods of lower utilization. Collect metrics for each system at a frequency that will not conceal problems, without collecting so often that monitoring becomes perceptibly taxing on the system (the observer effect) or creates noise in your monitoring data by sampling time intervals that are too short to contain meaningful data.

- **Tagged by scope.** Each of your hosts operates simultaneously in multiple scopes, and you may want to check on the aggregate health of any of these scopes, or their combinations. For example: how is production doing in aggregate? How about production in the Northeast U.S.? How about a particular role or service? It is important to retain the multiple scopes associated with your data so that you can alert on problems from any scope, and quickly investigate outages without being limited by a fixed hierarchy of hosts.

- **Long-lived.** If you discard data too soon, or if after a period of time your monitoring system aggregates your metrics to reduce storage costs, then you lose important information about what happened in the past. Retaining your raw data for a year or more makes it much easier to know what “normal” is, especially if your metrics have monthly, seasonal, or annual variations.

## <a name="what-is-distributed-tracing"></a> 3. What is Distributed Tracing?
Expand All @@ -82,17 +91,18 @@ Metrics gives you a broad overview of now the system is performing. Tracing uses

---

# <a name="what-is-monitoring"></a> What is Monitoring?
## <a name="what-is-monitoring"></a> 🔎 What is Monitoring?
Monitoring is the process of collecting logs and metrics data to observe the behavior of software and detect abnormal behaviors.
Monitoring is based to know with antecedence which signals you want to monitor.
Monitoring shows something that is wrong.

## <a name="what-is-security-information-event-management"></a> What is Security Information Event Management (SIEM)?
## <a name="what-is-security-information-event-management"></a> 🔐 Security Information Event Management (SIEM)?
A specialized telemetry system for use by security and compliance teams, and a specilization of centralized logging and metrics.
The technique was in use long before the terms was formalized in the mid 2000.
It's a solution to help the organization detect, analyze and act against security issues before it harms the company.

---
#### Sources
- <a target="_blank" href="https://www.amazon.com.br/Software-Telemetry-Reliable-Logging-Monitoring/dp/161729814X">Software Telemetry: Reliable Logging and Monitoring</a>
- <a target="_blank" href="https://www.datadoghq.com/blog/monitoring-101-collecting-data/">Datadog</a>
- <a target="_blank" href="https://www.datadoghq.com/blog/monitoring-101-collecting-data/">Datadog</a>
- <a target="_blank" href="https://www.bmc.com/blogs/monitoring-logging-tracing/">BMC</a>
10 changes: 5 additions & 5 deletions gatsby-config.js
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ module.exports = {
},
{
name: `LinkedIn`,
url: `https://www.linkedin.com/in/vinicius-morais-dutra-5260bb116/`
url: `https://www.linkedin.com/in/software-engineer-vinicius-morais-dutra/`
}
]
}
Expand Down Expand Up @@ -80,9 +80,9 @@ module.exports = {
{
resolve: `gatsby-plugin-manifest`,
options: {
name: `minimal-blog - @lekoarts/gatsby-theme-minimal-blog`,
short_name: `minimal-blog`,
description: `Typography driven, feature-rich blogging theme with minimal aesthetics. Includes tags/categories support and extensive features for code blocks such as live preview, line numbers, and code highlighting.`,
name: `Software Engineering Blog`,
short_name: `software-engineering-blog`,
description: `I hope to share some techs knowledgment and contribute to the community.`,
start_url: `/`,
background_color: `#fff`,
// This will impact how browsers show your PWA/website
Expand Down Expand Up @@ -147,7 +147,7 @@ module.exports = {
}
`,
output: `rss.xml`,
title: `Minimal Blog - @lekoarts/gatsby-theme-minimal-blog`
title: `@vinicinbgs blog RSS Feed`
}
]
}
Expand Down
2 changes: 1 addition & 1 deletion src/components/Comments/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ const Comments = ({ path }: Props) => {
script.setAttribute('issue-term', path ?? '/homepage');

document.getElementById('comments')?.appendChild(script);
});
}, []);

return <div id="comments"></div>;
};
Expand Down
2 changes: 1 addition & 1 deletion static/robots.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
User-agent: *

sitemap: https://minimal-blog.lekoarts.de/sitemap.xml
sitemap: https://beadev.net/sitemap.xml
Loading

0 comments on commit 65bd518

Please sign in to comment.