Study claims that AI models can be tricked by poetry into revealing nuclear weapons secrets

  1. HOME
  2. SCIENCE
  3. Study claims that AI models can be tricked by poetry into revealing nuclear weapons secrets
  • Last update: 4 days ago
  • 2 min read
  • 39 Views
  • SCIENCE
Study claims that AI models can be tricked by poetry into revealing nuclear weapons secrets

Recent research reveals that phrasing input as poetry can bypass safety mechanisms in AI systems like ChatGPT, enabling the creation of instructions for malware or even chemical and nuclear weapons. Leading AI developers, including OpenAI, Google, Meta, and Microsoft, state their models include safeguards to block harmful content. OpenAI, for instance, uses a combination of algorithmic filters and human reviewers to prevent hate speech, explicit material, and other policy-violating outputs.

However, the new study demonstrates that using poetic input, sometimes called adversarial poetry, can circumvent these controls even in the most sophisticated AI models. Researchers from Sapienza University of Rome and other institutions discovered that this technique acts as a universal bypass for AI families, including models by OpenAI, Google, Meta, and Chinas DeepSeek.

The preprint study posted on arXiv claims, stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

In their experiments, the researchers submitted short poems and metaphorical verses to the AI systems to elicit harmful outputs. They found that poetic inputs produced unsafe responses at significantly higher rates than standard prompts with the same intent. Certain poetic prompts led to unsafe behaviour in nearly 90% of attempts.

This approach was particularly effective in obtaining instructions for cyberattacks, password cracking, data extraction, and malware creation. It also enabled researchers to gather information on nuclear weapons development with success rates between 40% and 55% across different AI models.

According to the study, poetic reformulation degrades refusal behaviour across all evaluated model families. When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply, highlighting gaps in current AI evaluation and compliance practices. The research does not disclose the exact poems used, as the technique is easy to replicate.

One key factor behind the effectiveness of poetic prompts is that AI models predict the next word based on probability. Since poems often have irregular structure, the AI finds it harder to detect harmful intent. Researchers urge the development of improved safety evaluation techniques to prevent AI from producing dangerous content. They also suggest further studies to identify which aspects of poetic form contribute to this misalignment.

OpenAI, Google, DeepSeek, and Meta have not yet responded to requests for comment on the findings.

Author: Sophia Brooks

Share