Study claims that AI models can be tricked by poetry into revealing nuclear weapons secrets
- Last update: 4 days ago
- 2 min read
- 39 Views
- SCIENCE
Recent research reveals that phrasing input as poetry can bypass safety mechanisms in AI systems like ChatGPT, enabling the creation of instructions for malware or even chemical and nuclear weapons. Leading AI developers, including OpenAI, Google, Meta, and Microsoft, state their models include safeguards to block harmful content. OpenAI, for instance, uses a combination of algorithmic filters and human reviewers to prevent hate speech, explicit material, and other policy-violating outputs.
However, the new study demonstrates that using poetic input, sometimes called adversarial poetry, can circumvent these controls even in the most sophisticated AI models. Researchers from Sapienza University of Rome and other institutions discovered that this technique acts as a universal bypass for AI families, including models by OpenAI, Google, Meta, and Chinas DeepSeek.
The preprint study posted on arXiv claims, stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
In their experiments, the researchers submitted short poems and metaphorical verses to the AI systems to elicit harmful outputs. They found that poetic inputs produced unsafe responses at significantly higher rates than standard prompts with the same intent. Certain poetic prompts led to unsafe behaviour in nearly 90% of attempts.
This approach was particularly effective in obtaining instructions for cyberattacks, password cracking, data extraction, and malware creation. It also enabled researchers to gather information on nuclear weapons development with success rates between 40% and 55% across different AI models.
According to the study, poetic reformulation degrades refusal behaviour across all evaluated model families. When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply, highlighting gaps in current AI evaluation and compliance practices. The research does not disclose the exact poems used, as the technique is easy to replicate.
One key factor behind the effectiveness of poetic prompts is that AI models predict the next word based on probability. Since poems often have irregular structure, the AI finds it harder to detect harmful intent. Researchers urge the development of improved safety evaluation techniques to prevent AI from producing dangerous content. They also suggest further studies to identify which aspects of poetic form contribute to this misalignment.
OpenAI, Google, DeepSeek, and Meta have not yet responded to requests for comment on the findings.
Author: Sophia Brooks
Share
Elon Musk predicts AI will render most skills irrelevant, but assures his children can still pursue higher education if they choose to
3 days ago 2 min read SCIENCE
What is the current status of SOCOM's AI targeting tests?
3 days ago 3 min read SCIENCE
Scientists uncover unsettling findings about wildfire smoke: 'Unnoticed chemical reactions'
3 days ago 2 min read SCIENCE
Tesla observers verify sightings of long-awaited new model on streets: 'It's an attractive car'
3 days ago 2 min read SCIENCE
Report suggests that the ship fire was probably caused by a battery
3 days ago 2 min read SCIENCE
Researchers discover remarkable effects of structures constructed by wild animals: 'Balancing costs and advantages'
3 days ago 3 min read SCIENCE
Curiosity Uncovered a Big Surprise Inside a Rock on Mars
3 days ago 3 min read SCIENCE
NASA Makes Significant Progress Towards Artemis II's Moon Mission
3 days ago 2 min read SCIENCE
NASA warns scientists about rapid acceleration of ocean changes: 'Speeding up'
3 days ago 2 min read SCIENCE
Surprising find in wetlands a decade after reintroducing extinct beavers: 'Has the potential to change local ecosystems'
3 days ago 2 min read SCIENCE