Skip to content

Add universal magic words embedding attack paper#120

Open
WhymustIhaveaname wants to merge 2 commits intocorca-ai:mainfrom
WhymustIhaveaname:add-magic-words-paper
Open

Add universal magic words embedding attack paper#120
WhymustIhaveaname wants to merge 2 commits intocorca-ai:mainfrom
WhymustIhaveaname:add-magic-words-paper

Conversation

@WhymustIhaveaname
Copy link
Copy Markdown

@WhymustIhaveaname WhymustIhaveaname commented Mar 30, 2026

Adds Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models to the Black-box attack section.

The paper discovers universal suffix tokens that manipulate embedding similarity scores to bypass LLM safety guardrails, tested across ChatGPT, DeepSeek, Qwen, and others. Also proposes a debiasing defense that requires no retraining.

Summary by CodeRabbit

릴리즈 노트

  • 문서화
    • README의 "검은색 상자 공격" 섹션에 새 연구 논문 참고 자료 추가 — "텍스트 임베딩 모델의 보안 우회에 관한 유니버설 매직 워드를 통한 LLM 탈옥 연구"(2025-01) 및 관련 arXiv 링크 포함.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd068c71-32a7-442d-afeb-88ab01f484c6

📥 Commits

Reviewing files that changed from the base of the PR and between 8a410fb and 6576bdf.

📒 Files selected for processing (1)
  • README.md
✅ Files skipped from review due to trivial changes (1)
  • README.md

Walkthrough

README.md 파일의 "### Black-box attack" 섹션에 새로운 항목이 1개 추가되었습니다. 텍스트 임베딩 모델의 LLM 보안우회 관련 논문(2025-01, arXiv:2501.18280)이 embedding 태그와 함께 등록되었습니다.

Changes

Cohort / File(s) Summary
Documentation Update
README.md
"Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models" 논문 항목 추가 (2025-01, embedding 태그, arXiv 링크 포함)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add universal magic words embedding attack paper' clearly and specifically describes the main change - adding a research paper about universal magic words for embedding attacks to the repository.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 69: 현재 README.md의 연대순이 맞지 않습니다: 항목 "Jailbreaking LLMs' Safeguard with
Universal Magic Words for Text Embedding Models", 2025-01, `embedding`,
[[paper]](...) 를 찾아 현재 2025-02 및 2025-03 항목 뒤에 있는 위치에서 잘라내어 2024-04 항목 다음, 즉
2025-02 항목보다 앞(현재 line 66과 67 사이)으로 삽입해 연대순(오름차순 또는 최신순에 맞는 방향)에 맞추세요; 해당 제목
문자열을 찾아 이동하면 됩니다.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d841590a-22c5-4397-b9ca-b4538ff3dd5a

📥 Commits

Reviewing files that changed from the base of the PR and between c8ae124 and 8a410fb.

📒 Files selected for processing (1)
  • README.md
📜 Review details
🔇 Additional comments (1)
README.md (1)

69-69: arXiv 링크가 유효하며 올바른 논문을 가리킵니다. 링크는 접근 가능(HTTP 200)하고, 논문 제목과 출판일(2025-01-30)이 정확하게 기재되어 있습니다.

- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](https://www.themoonlight.io/paper/share/44eaf8b8-2f20-4d35-a438-1fada8e091fc) [[repo]](https://github.com/controllability/jailbreak-evaluation)
- "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](https://www.themoonlight.io/paper/share/156c1cb3-c9ea-443d-9cfc-3f494f711df5) [[repo]](https://github.com/Aniloid2/Confidence_Elicitation_Attacks)
- "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](https://arxiv.org/pdf/2503.20823) [[repo]](https://github.com/naver-ai/JOOD)
- "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](https://arxiv.org/abs/2501.18280)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

시간순 정렬을 고려하여 항목 위치를 조정하세요.

현재 2025-01 날짜의 항목이 2025-02 (line 67) 및 2025-03 (line 68) 항목 뒤에 배치되어 있습니다. 최근 항목들(2024-2025)이 대체로 시간순으로 정렬되어 있으므로, 일관성을 위해 이 항목을 line 67 이전(line 66과 67 사이)으로 이동하는 것을 권장합니다.

📝 권장 위치 조정

현재:

- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](...)
- "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](...)
- "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](...)
- "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...)

권장:

- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](...)
- "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...)
- "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](...)
- "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 69, 현재 README.md의 연대순이 맞지 않습니다: 항목 "Jailbreaking LLMs'
Safeguard with Universal Magic Words for Text Embedding Models", 2025-01,
`embedding`, [[paper]](...) 를 찾아 현재 2025-02 및 2025-03 항목 뒤에 있는 위치에서 잘라내어 2024-04
항목 다음, 즉 2025-02 항목보다 앞(현재 line 66과 67 사이)으로 삽입해 연대순(오름차순 또는 최신순에 맞는 방향)에 맞추세요;
해당 제목 문자열을 찾아 이동하면 됩니다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant