Finetuning OpenAI model with prompt injections, to act as the attacker LLM #370
mantmishra
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Going with OpenAI GPT4o as the attacker LLM as it's the highest ranked LLM model in most benchmarks. However, it refuses to do prompt injections in almost all strategies citing "It's not able to assist with the task" - likely due to safeguards in place by OpenAI.
Finetuning the model with adversarial examples also doesn't work as OpenAI endpoint throws the error "The job failed due to an invalid training file. This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies."
Has anyone found a workaround for this issue? What alternate model can be used as the attacker LLM that doesn't have as many safeguards in place?
Beta Was this translation helpful? Give feedback.
All reactions