Add universal magic words embedding attack paper#120
Add universal magic words embedding attack paper#120WhymustIhaveaname wants to merge 2 commits intocorca-ai:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughREADME.md 파일의 "### Black-box attack" 섹션에 새로운 항목이 1개 추가되었습니다. 텍스트 임베딩 모델의 LLM 보안우회 관련 논문(2025-01, arXiv:2501.18280)이 Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 69: 현재 README.md의 연대순이 맞지 않습니다: 항목 "Jailbreaking LLMs' Safeguard with
Universal Magic Words for Text Embedding Models", 2025-01, `embedding`,
[[paper]](...) 를 찾아 현재 2025-02 및 2025-03 항목 뒤에 있는 위치에서 잘라내어 2024-04 항목 다음, 즉
2025-02 항목보다 앞(현재 line 66과 67 사이)으로 삽입해 연대순(오름차순 또는 최신순에 맞는 방향)에 맞추세요; 해당 제목
문자열을 찾아 이동하면 됩니다.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d841590a-22c5-4397-b9ca-b4538ff3dd5a
📒 Files selected for processing (1)
README.md
📜 Review details
🔇 Additional comments (1)
README.md (1)
69-69: arXiv 링크가 유효하며 올바른 논문을 가리킵니다. 링크는 접근 가능(HTTP 200)하고, 논문 제목과 출판일(2025-01-30)이 정확하게 기재되어 있습니다.
| - "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](https://www.themoonlight.io/paper/share/44eaf8b8-2f20-4d35-a438-1fada8e091fc) [[repo]](https://github.com/controllability/jailbreak-evaluation) | ||
| - "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](https://www.themoonlight.io/paper/share/156c1cb3-c9ea-443d-9cfc-3f494f711df5) [[repo]](https://github.com/Aniloid2/Confidence_Elicitation_Attacks) | ||
| - "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](https://arxiv.org/pdf/2503.20823) [[repo]](https://github.com/naver-ai/JOOD) | ||
| - "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](https://arxiv.org/abs/2501.18280) |
There was a problem hiding this comment.
시간순 정렬을 고려하여 항목 위치를 조정하세요.
현재 2025-01 날짜의 항목이 2025-02 (line 67) 및 2025-03 (line 68) 항목 뒤에 배치되어 있습니다. 최근 항목들(2024-2025)이 대체로 시간순으로 정렬되어 있으므로, 일관성을 위해 이 항목을 line 67 이전(line 66과 67 사이)으로 이동하는 것을 권장합니다.
📝 권장 위치 조정
현재:
- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](...)
- "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](...)
- "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](...)
- "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...)권장:
- "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [[paper]](...)
- "Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models", 2025-01, `embedding`, [[paper]](...)
- "Confidence Elicitation: A New Attack Vector for Large Language Models", 2025-02, ICLR(poster) 25 [[paper]](...)
- "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy", 2025-03, CVPR 25 [[paper]](...)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` at line 69, 현재 README.md의 연대순이 맞지 않습니다: 항목 "Jailbreaking LLMs'
Safeguard with Universal Magic Words for Text Embedding Models", 2025-01,
`embedding`, [[paper]](...) 를 찾아 현재 2025-02 및 2025-03 항목 뒤에 있는 위치에서 잘라내어 2024-04
항목 다음, 즉 2025-02 항목보다 앞(현재 line 66과 67 사이)으로 삽입해 연대순(오름차순 또는 최신순에 맞는 방향)에 맞추세요;
해당 제목 문자열을 찾아 이동하면 됩니다.
Adds Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models to the Black-box attack section.
The paper discovers universal suffix tokens that manipulate embedding similarity scores to bypass LLM safety guardrails, tested across ChatGPT, DeepSeek, Qwen, and others. Also proposes a debiasing defense that requires no retraining.
Summary by CodeRabbit
릴리즈 노트