Finetuning OpenAI model with prompt injections, to act as the attacker LLM #370

mantmishra · 2024-09-12T08:32:42Z

mantmishra
Sep 12, 2024

Going with OpenAI GPT4o as the attacker LLM as it's the highest ranked LLM model in most benchmarks. However, it refuses to do prompt injections in almost all strategies citing "It's not able to assist with the task" - likely due to safeguards in place by OpenAI.

Finetuning the model with adversarial examples also doesn't work as OpenAI endpoint throws the error "The job failed due to an invalid training file. This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies."

Has anyone found a workaround for this issue? What alternate model can be used as the attacker LLM that doesn't have as many safeguards in place?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning OpenAI model with prompt injections, to act as the attacker LLM #370

{{title}}

Replies: 0 comments

Select a reply

Finetuning OpenAI model with prompt injections, to act as the attacker LLM #370

mantmishra Sep 12, 2024

Replies: 0 comments

mantmishra
Sep 12, 2024