Distractor-Based Jailbreaking Attacks in Language Models and Associated Changes in Chain-of-Thought Content
Although not exactly my main area of expertise, I nevertheless had the pleasure of doing some research in AI safety, particularly in the area of jailbreaking attacks on large language models. This originally grew out of a project from the Carnegie AI Safety Initiative club.The paper has been accepted for publication in AAAI 2026! A camera-ready version will be out soon, and we will be presenting it at the conference. Our abstract is as follows:We identify a jailbreaking vulnerability in multiple open-source LLMs: by augmenting dangerous requests using certain "distractors" to obfuscate their intent, we elicit specific, actionable responses on a wide variety of harmful topics. We find that such an attack noticeably alters the contents of these models' chains of thought, including changed frequencies of seemingly unrelated n-grams and heightened ethical scrutiny about harmful requests even when their response is ultimately jailbroken.Also, big shoutout to my co-author Ningning, as well as Ida Mattson for advising us on this project!