Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,10 @@ Validation middleware returns HTTP `400` with the shape:

See `docs/backend/request-validation-framework.md` for implementation details and security notes.

## Documentation

- [SLA/SLO Definitions and Alert Thresholds](docs/backend/SLA_SLO.md)

## Contributing

1. Fork the repo and create a branch: `git checkout -b feature/<ticket>-description`
Expand Down
55 changes: 55 additions & 0 deletions docs/backend/SLA_SLO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Service Level Agreements (SLA) & Service Level Objectives (SLO)

This document outlines the Service Objectives and Alert Thresholds implemented for the TalentTrust Backend operations. Setting these operational metrics allows us to proactively track performance, ensure high availability, and configure reliable alerts when our service degrades.

## Core Concepts

- **Service Level Agreement (SLA)**: A contractual commitment to maintain specific performance benchmarks. This backend focuses on providing the tools and metrics to enforce internal SLAs.
- **Service Level Objective (SLO)**: Our internal target for a given operation. For example, maintaining a `99.9%` success rate on the API.
- **Alert Thresholds**: The actionable limits where error rates or response latencies degrade and require developer or operator intervention.

## Defined Objectives

### Health Check (`/health`)
- **Target Success Rate**: 99.99%
- **Target Latency (P95)**: 50ms
- **Alert Trigger**: Error Rate $\ge$ 0.1% OR Average Latency $\ge$ 150ms over a 5-minute rolling window.

### Contracts API (`/api/v1/contracts`)
- **Target Success Rate**: 99.9%
- **Target Latency (P95)**: 200ms
- **Alert Trigger**: Error Rate $\ge$ 1.0% OR Average Latency $\ge$ 400ms over a 5-minute rolling window.

## Usage in Codebase

The definitions are maintained within `src/operations/service-objectives.ts`.
We use NatSpec-style comments to document types, thresholds, and objectives to align with secure systems and standard architectures.

```typescript
import { isThresholdBreached, DefaultAlertThresholds } from './operations/service-objectives';

// Example: Evaluating if a breach occurred
const hasBreached = isThresholdBreached(
DefaultAlertThresholds.contractsApi,
currentErrorRate, // e.g., fetched from metrics store like Prometheus
currentAverageLatency // e.g., measured via APM
);

if (hasBreached) {
// Trigger PagerDuty, Slack alert, or perform fallbacks
}
```

## Expanding Metrics

To define SLOs for new API routes or operations:
1. Open `src/operations/service-objectives.ts`
2. Define a new entry inside `DefaultServiceObjectives`.
3. Define the corresponding alerting limits in `DefaultAlertThresholds`.
4. Ensure relevant automated tests run cleanly.

## Security and Threat Assumptions

Since SLO and SLA data can be sensitive and used for operational integrity:
- **Metric Spoofing**: Ensure that metrics collected for the `isThresholdBreached` function are generated by trusted internal observability tools, preventing malicious external sources from triggering false positive alerts (DDoS on monitoring).
- **Efficiency**: Evaluation logic is kept completely synchronous $O(1)$ and lightweight to prevent evaluating thresholds from becoming a bottleneck during high-load periods.
60 changes: 60 additions & 0 deletions src/operations/service-objectives.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import {
DefaultServiceObjectives,
DefaultAlertThresholds,
isThresholdBreached,
OperationType,
} from './service-objectives';

describe('Service Objectives and Alert Thresholds', () => {
describe('Default Configuration Validation', () => {
it('should have valid target success rates (<= 100%)', () => {
Object.values(DefaultServiceObjectives).forEach((objective) => {
expect(objective.targetSuccessRatePercent).toBeLessThanOrEqual(100);
expect(objective.targetSuccessRatePercent).toBeGreaterThan(0);
});
});

it('should have logical latency goals (p95 <= p99)', () => {
Object.values(DefaultServiceObjectives).forEach((objective) => {
expect(objective.targetLatencyP95Ms).toBeLessThanOrEqual(objective.targetLatencyP99Ms);
expect(objective.targetLatencyP95Ms).toBeGreaterThan(0);
});
});

it('should have positive alert thresholds', () => {
Object.values(DefaultAlertThresholds).forEach((threshold) => {
expect(threshold.maxErrorRatePercent).toBeGreaterThan(0);
expect(threshold.maxAverageLatencyMs).toBeGreaterThan(0);
expect(threshold.evaluationWindowSeconds).toBeGreaterThan(0);
});
});
});

describe('isThresholdBreached()', () => {
const mockThreshold = {
operationType: OperationType.API_REQUEST,
maxErrorRatePercent: 1.0,
maxAverageLatencyMs: 500,
evaluationWindowSeconds: 60,
};

it('should return false when metrics are within safe limits', () => {
expect(isThresholdBreached(mockThreshold, 0.5, 300)).toBe(false);
expect(isThresholdBreached(mockThreshold, 0.99, 499)).toBe(false);
});

it('should return true when error rate breaches the maximum limit', () => {
expect(isThresholdBreached(mockThreshold, 1.0, 300)).toBe(true);
expect(isThresholdBreached(mockThreshold, 5.0, 300)).toBe(true);
});

it('should return true when average latency breaches the maximum limit', () => {
expect(isThresholdBreached(mockThreshold, 0.5, 500)).toBe(true);
expect(isThresholdBreached(mockThreshold, 0.5, 1000)).toBe(true);
});

it('should return true when both metrics breach limits', () => {
expect(isThresholdBreached(mockThreshold, 2.0, 600)).toBe(true);
});
});
});
105 changes: 105 additions & 0 deletions src/operations/service-objectives.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
/**
* @title Service Objectives and Alert Thresholds
* @dev Defines the Service Level Objectives (SLOs) and Service Level Agreements (SLAs) for the backend operations.
*/

export enum OperationType {
API_REQUEST = 'API_REQUEST',
DATABASE_QUERY = 'DATABASE_QUERY',
EXTERNAL_API_CALL = 'EXTERNAL_API_CALL',
}

/**
* @dev Represents the target metrics for a specific service or operation to ensure high reliability.
*/
export interface ServiceObjective {
operationType: OperationType;
/**
* @dev Target availability/success rate as a percentage (e.g., 99.9). Must be <= 100.
*/
targetSuccessRatePercent: number;
/**
* @dev Maximum acceptable latency in milliseconds for the 95th percentile (p95).
*/
targetLatencyP95Ms: number;
/**
* @dev Maximum acceptable latency in milliseconds for the 99th percentile (p99).
*/
targetLatencyP99Ms: number;
}

/**
* @dev Defines conditions under which an alert should be triggered for a specific operation.
*/
export interface AlertThreshold {
operationType: OperationType;
/**
* @dev Trigger alert if error rate percentage exceeds this value.
*/
maxErrorRatePercent: number;
/**
* @dev Trigger alert if average latency exceeds this value over the evaluation window.
*/
maxAverageLatencyMs: number;
/**
* @dev The time window in seconds over which the metrics should be evaluated to trigger alerts.
*/
evaluationWindowSeconds: number;
}

/**
* @dev Registry of default service objectives for key system operations.
*/
export const DefaultServiceObjectives: Record<string, ServiceObjective> = {
healthCheck: {
operationType: OperationType.API_REQUEST,
targetSuccessRatePercent: 99.99,
targetLatencyP95Ms: 50,
targetLatencyP99Ms: 100,
},
contractsApi: {
operationType: OperationType.API_REQUEST,
targetSuccessRatePercent: 99.9,
targetLatencyP95Ms: 200,
targetLatencyP99Ms: 500,
},
};

/**
* @dev Registry of default alert thresholds corresponding to the system operations.
*/
export const DefaultAlertThresholds: Record<string, AlertThreshold> = {
healthCheck: {
operationType: OperationType.API_REQUEST,
maxErrorRatePercent: 0.1, // Alert if error rate > 0.1%
maxAverageLatencyMs: 150,
evaluationWindowSeconds: 300, // Evaluate over 5 minutes
},
contractsApi: {
operationType: OperationType.API_REQUEST,
maxErrorRatePercent: 1.0, // Alert if error rate > 1.0%
maxAverageLatencyMs: 400,
evaluationWindowSeconds: 300,
},
};

/**
* @dev Evaluates whether the current metrics breach the defined alert threshold for an operation.
* @param threshold The threshold configuration to evaluate against.
* @param currentErrorRateThe observed error rate percentage.
* @param currentAverageLatencyMs The observed average latency in ms.
* @returns true if an alert should be triggered, false otherwise.
*/
export function isThresholdBreached(
threshold: AlertThreshold,
currentErrorRate: number,
currentAverageLatencyMs: number
): boolean {
if (currentErrorRate >= threshold.maxErrorRatePercent) {
return true;
}
if (currentAverageLatencyMs >= threshold.maxAverageLatencyMs) {
return true;
}
return false;
}
Loading