New research reveals large language models struggle to distinguish privileged system instructions from untrusted user input. Models prioritize text style over content, leading to concerning jailbreaks where the LLM can be tricked into violating its safety policies. This "role confusion" can be exploited by slightly altering the input's formatting, significantly increasing attack success rates.
Opening Kapyn…