Throughout 2023, the industry moved from "black-box" guessing of bypass codes to scientific red-teaming.
: These attacks often involve "paraphrasers" that reword harmful requests into complex, multi-layered prompts that look benign to simple keyword detectors but retain their harmful intent. Why 2023 Was a Turning Point Chat Bypass 2023 - Synergy
: Safety benchmarks like VE-Safety and others were curated to include categories like cybercrime and physical harm, specifically to train models against "Image-as-Basis" threats and complex prompt engineering. By layering these biases, attackers can create a
: Researchers identified that multi-turn conversations could lead to "intent drift," where the cumulative effect of a long conversation gradually bypasses safety layers that would block a single-turn request. Defensive Responses By layering these biases
: Attackers began using autonomous agents to adapt bypass strategies in real-time, creating "adaptive" prompts that could learn from a model's refusal and try a different combination of biases.
Unlike basic prompt injections, the Synergy approach leverages the inherent cognitive biases embedded in LLMs during their training. By layering these biases, attackers can create a "synergistic" effect that is significantly more effective at bypassing safety protocols than any single bias alone.