About 782 results
Open links in new tab
  1. We note this evaluation scheme does not score on-topic responses on quality or accuracy, as our focus is on bypassing refusal mechanisms. Anecdotally, however, jailbroken responses often …

  2. Multilingual Jailbreak Challenges in Large Language Models . Jumping over the Textual gate of alignment! Very high success rate for the cross-modal attack! Thank you very much! Don't …

  3. We provide a theoretical framework for analyzing LLM jailbreaking. We demonstrate an impossibility result on avoiding jailbreak under current RLHF-based safety alignment. We …

  4. Abstract Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases of …

  5. effective jailbreaking method on open-source models. We show 074 that strong, safe LLMs (e.g., 70B) can be easily misdirected by weak, unsafe models to produce 075 undesired outputs with …

  6. We test the performance of SafeDecoding on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets. SafeDecoding outperforms all baselines in most cases. The …

  7. his paper investigate why adversarial attacks succeed against safety trained LL. s, GPT-4 and Claude 1.3. They hypothesize two failure modes for why LLM safety training fails. The failure …