Even with preference alignment, LLMs can be enticed into harmful behavior via adversarial prompts ๐.
๐จ Breaking: our theoretical findings confirm:
LLM alignment is fundamentally limited!
More details, on framework, statistical bounds and phenomenal defense results ๐๐ป
๐จ Breaking: our theoretical findings confirm:
LLM alignment is fundamentally limited!
More details, on framework, statistical bounds and phenomenal defense results ๐๐ป