Skip to content

Commit d34c2c2

Browse files
authored
Merge pull request #150 from risk-first/practice-pages
Added AI As Judge
2 parents 685e566 + 24b4620 commit d34c2c2

File tree

6 files changed

+1562
-1
lines changed

6 files changed

+1562
-1
lines changed

dictionary.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -404,3 +404,4 @@ incentivised
404404
stanislav
405405
petrov
406406
showcasing
407+
adversarial

docs/ai/Practices/AI-As-Judge.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: AI As Judge
3+
description: "Using the outputs of one (trained) AI to measure the performance of another"
4+
featured:
5+
class: c
6+
element: '<action>AI-As-Judge</action>'
7+
tags:
8+
- AI As Judge
9+
- AI Practice
10+
practice:
11+
mitigates:
12+
- tag: Emergent Behaviour
13+
reason: "Could catch early signs of unexpected AI behaviour by flagging responses that deviate from expected norms."
14+
efficacy: High
15+
- tag: Unintended Cascading Failures
16+
reason: "Can act as a real-time filter to catch dangerous AI outputs before they propagate (e.g., financial trading AI making reckless decisions)."
17+
- tag: Social Manipulation
18+
reason: "Can prevent harmful misinformation, disinformation, and deepfakes from spreading by having a second user-owned AI fact-check or block misleading content."
19+
- tag: Loss Of Human Control
20+
reason: "Can enforce alignment principles by rejecting responses that optimise for harmful proxy goals."
21+
---
22+
23+
<PracticeIntro details={frontMatter} />
24+
25+
- AI-As-Judge is a mitigation technique where one AI model generates responses while a second AI evaluates and filters them based on predefined rules, helping to enforce content moderation, alignment with ethical guidelines, and safety constraints.
26+
27+
- Compare with [Human In The Loop](/tags/Human-In-The-Loop), although once trained, the AI is always vigilant.
28+
29+
- Requires extensive training and evaluation on its own, but potentially could be a service provided to enhance controls in
30+
31+
32+
## Sources
33+
34+
- [Using LLM-As-A-Judge for an automated and versatile evaluation](https://huggingface.co/learn/cookbook/llm_judge)

docs/practices/Testing-and-Quality-Assurance/Security-Testing.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ description: Ensuring the application is secure by identifying vulnerabilities.
44
tags:
55
- Practice
66
- Security Testing
7+
- AI Practice
78
featured:
89
class: c
910
element: '<action>Security Test</action>'
@@ -13,6 +14,7 @@ practice:
1314
- "Vulnerability Testing"
1415
- "Security Assessment"
1516
- "Security Hardening"
17+
- Red Teaming
1618
mitigates:
1719
- tag: Security Risk
1820
reason: "Identifies and addresses vulnerabilities in the software."
@@ -29,6 +31,10 @@ practice:
2931
reason: "Requires specialized skills and tools, adding complexity."
3032
- tag: Agency Risk
3133
reason: "Likely requires security experts with specialist skills."
34+
- tag: Emergent Behaviour
35+
reason: "Helps identify unintended AI behaviors before deployment by stress-testing AI in real-world scenarios."
36+
- tag: Misaligned Goals
37+
reason: "Red teams probe AI for loopholes where reward hacking or proxy goals emerge, ensuring AI doesn't optimise in harmful ways."
3238
related:
3339
- ../Development-and-Coding/Coding
3440
- ../Testing-and-Quality-Assurance/Performance-Testing
@@ -43,6 +49,10 @@ practice:
4349
4450
Security Testing involves assessing the security of software applications to identify vulnerabilities and ensure they are protected against threats and attacks. This practice is essential for maintaining the integrity, confidentiality, and availability of software systems.
4551

52+
- [Red Teaming](https://en.wikipedia.org/wiki/Red_team) is more effective for high-level behavioural risks, like deception, exploitation, and adversarial misuse.
53+
54+
- [Penetration Testing](https://en.wikipedia.org/wiki/Penetration_test) is more effective for technical security risks, like vulnerabilities in APIs, data injection flaws, and adversarial attacks on AI safety mechanisms.
55+
4656
## See Also
4757

4858
<TagList tag="Security Testing" />

docs/tags.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -561,4 +561,8 @@
561561

562562
"Global AI Governance":
563563
label: "Global AI Governance"
564-
permalink: "Global-AI-Governance"
564+
permalink: "Global-AI-Governance"
565+
566+
"AI As Judge":
567+
label: "AI As Judge"
568+
permalink: "AI-As-Judge"

0 commit comments

Comments
 (0)