Add universal magic words embedding attack paper by WhymustIhaveaname · Pull Request #120 · corca-ai/awesome-llm-security

WhymustIhaveaname · 2026-03-30T02:03:11Z

Adds Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models to the Black-box attack section.

The paper discovers universal suffix tokens that manipulate embedding similarity scores to bypass LLM safety guardrails, tested across ChatGPT, DeepSeek, Qwen, and others. Also proposes a debiasing defense that requires no retraining.

Summary by CodeRabbit

릴리즈 노트

문서화
- README의 "검은색 상자 공격" 섹션에 새 연구 논문 참고 자료 추가 — "텍스트 임베딩 모델의 보안 우회에 관한 유니버설 매직 워드를 통한 LLM 탈옥 연구"(2025-01) 및 관련 arXiv 링크 포함.

coderabbitai · 2026-03-30T02:04:06Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd068c71-32a7-442d-afeb-88ab01f484c6

📥 Commits

Reviewing files that changed from the base of the PR and between 8a410fb and 6576bdf.

📒 Files selected for processing (1)

README.md

✅ Files skipped from review due to trivial changes (1)

README.md

Walkthrough

README.md 파일의 "### Black-box attack" 섹션에 새로운 항목이 1개 추가되었습니다. 텍스트 임베딩 모델의 LLM 보안우회 관련 논문(2025-01, arXiv:2501.18280)이 embedding 태그와 함께 등록되었습니다.

Changes

Cohort / File(s)	Summary
Documentation Update `README.md`	"Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models" 논문 항목 추가 (2025-01, `embedding` 태그, arXiv 링크 포함)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add universal magic words embedding attack paper' clearly and specifically describes the main change - adding a research paper about universal magic words for embedding attacks to the repository.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 69: 현재 README.md의 연대순이 맞지 않습니다: 항목 "Jailbreaking LLMs' Safeguard with
Universal Magic Words for Text Embedding Models", 2025-01, `embedding`,
[[paper]](...) 를 찾아 현재 2025-02 및 2025-03 항목 뒤에 있는 위치에서 잘라내어 2024-04 항목 다음, 즉
2025-02 항목보다 앞(현재 line 66과 67 사이)으로 삽입해 연대순(오름차순 또는 최신순에 맞는 방향)에 맞추세요; 해당 제목
문자열을 찾아 이동하면 됩니다.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d841590a-22c5-4397-b9ca-b4538ff3dd5a

📥 Commits

Reviewing files that changed from the base of the PR and between c8ae124 and 8a410fb.

📒 Files selected for processing (1)

README.md

📜 Review details

🔇 Additional comments (1)

README.md (1)

69-69: arXiv 링크가 유효하며 올바른 논문을 가리킵니다. 링크는 접근 가능(HTTP 200)하고, 논문 제목과 출판일(2025-01-30)이 정확하게 기재되어 있습니다.

coderabbitai · 2026-03-30T02:07:14Z

README.md

 - "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](https://www.themoonlight.io/paper/share/44eaf8b8-2f20-4d35-a438-1fada8e091fc) [[repo]](https://github.com/controllability/jailbreak-evaluation)
 - "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](https://www.themoonlight.io/paper/share/156c1cb3-c9ea-443d-9cfc-3f494f711df5) [[repo]](https://github.com/Aniloid2/Confidence_Elicitation_Attacks)
 - "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](https://arxiv.org/pdf/2503.20823) [[repo]](https://github.com/naver-ai/JOOD)
+- "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](https://arxiv.org/abs/2501.18280)


⚠️ Potential issue | 🟡 Minor

시간순 정렬을 고려하여 항목 위치를 조정하세요.

현재 2025-01 날짜의 항목이 2025-02 (line 67) 및 2025-03 (line 68) 항목 뒤에 배치되어 있습니다. 최근 항목들(2024-2025)이 대체로 시간순으로 정렬되어 있으므로, 일관성을 위해 이 항목을 line 67 이전(line 66과 67 사이)으로 이동하는 것을 권장합니다.

📝 권장 위치 조정

현재:

- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](...) - "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](...) - "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](...) - "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...)

권장:

- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](...) - "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...) - "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](...) - "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](...)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` at line 69, 현재 README.md의 연대순이 맞지 않습니다: 항목 "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...) 를 찾아 현재 2025-02 및 2025-03 항목 뒤에 있는 위치에서 잘라내어 2024-04 항목 다음, 즉 2025-02 항목보다 앞(현재 line 66과 67 사이)으로 삽입해 연대순(오름차순 또는 최신순에 맞는 방향)에 맞추세요; 해당 제목 문자열을 찾아 이동하면 됩니다.

Add embedding attack paper to Black-box attack

8a410fb

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Fix chronological order: move 2025-01 entry before 2025-02

6576bdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add universal magic words embedding attack paper#120

Add universal magic words embedding attack paper#120
WhymustIhaveaname wants to merge 2 commits intocorca-ai:mainfrom
WhymustIhaveaname:add-magic-words-paper

WhymustIhaveaname commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WhymustIhaveaname commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

릴리즈 노트

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WhymustIhaveaname commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading