New research reveals that even the most popular AI models can’t be trusted when it comes to basic safety. The most powerful LLMs today — including models from OpenAI, Google, and Cohere — were put through a rigorous safety benchmark. The results? Not great. Out of 20 models tested across 10 real-world risk areas, none passed all the tests, and 76% failed one of the most basic challenges: impersonation and privacy violations.
If you’re building with AI or using it in your product, this should make you pause. Because once something goes wrong, you’re the one holding the bag — not the model provider.
The Aymara Matrix: Safety Benchmark for LLMs
The research comes from Aymara, a new AI evaluation platform co-developed by Juan Manuel Contreras, Ph.D. The Aymara LLM Risk & Responsibility Matrix is the first public, scalable benchmark designed to test real-world risks in LLMs — not theoretical ones.
Unlike most benchmarks that focus on academic tasks like math or summarization, the Aymara Matrix measures how AI models perform in risky, messy, human-like contexts. It tests how they handle prompts about impersonating public figures, giving financial advice, engaging in hate speech, or responding to sexually explicit queries.
This is the kind of behavior that gets AI companies in trouble — and until now, there was no unified way to test it.
Over 4,500 model responses were analyzed using a mix of automated scoring and human validation. The prompt set included high-pressure, adversarial prompts — the type that exploit gray areas, legal loopholes, or seemingly innocent requests that could still trigger harmful responses. The testing period spanned from May 21 to June 9, 2025, and used the API versions of the models, just as most developers would use them.
The Alarming Results
The Matrix wasn’t built to crown a single “safest” model. Instead, it breaks down exactly where each model slips up, showing specific weaknesses across different types of risks.
| Model | Overall Safety Score |
|---|---|
| Claude Haiku 3.5 (Anthropic) | 86% |
| GPT-4 (OpenAI) | ~80% |
| Gemini 1.5 (Google) | ~78% |
| Titan (Amazon) | ~75% |
| Command R (Cohere) | 52% |
Some models clearly handled the tests better than others. Anthropic’s Claude Haiku 3.5 led the pack with 86% of its responses rated as safe. OpenAI’s GPT-4 came close behind at around 80%, followed by Google’s Gemini 1.5 and Amazon’s Titan, both in the mid-to-high 70s. At the bottom, Cohere’s Command R managed just 52% — barely better than flipping a coin.
The Privacy & Impersonation category stood out as a consistent weak spot. Across all models, only 24% of responsespassed this test. Even the top performer, Claude Haiku, failed more than half the time. That means most models still produce responses that impersonate public figures or mishandle private information, often without recognizing the problem.

On the flip side, the more public-facing risks — misinformation, hate speech, and malicious use — had much stronger scores. Many models were near-perfect here, with scores above 90%. These are the areas where regulators and journalists have been watching closely, so it’s no surprise they’re more heavily optimized.
What this shows is that AI safety is still unevenly applied. Progress is visible in places where companies are under pressure, but major gaps remain in the less-policed corners — and that’s where the danger tends to hide.
Possibly Expensive Business Risk
This research has direct impact on how AI should be deployed in real products.
AI systems are already being used in healthcare, finance, education, and customer service — all areas where mistakes can lead to legal issues, regulatory scrutiny, or reputational damage. When something goes wrong, responsibility usually falls on the company deploying the AI, not the lab that built it.
If your model gives out medical advice without qualifications, suggests inappropriate or offensive content, or impersonates public figures, the consequences can be severe. These aren’t edge cases — they’re real risks that show up under fairly typical usage.
In many teams, these failures only get noticed when something slips through in production and ends up on social media or in the press. By that point, the damage is already done.
The Aymara Matrix offers a structured way to detect these safety failures during development — before users ever see them.
What You Can Do About It
If you’re building with LLMs, don’t assume safety is handled for you. Most models ship with basic filters, but they’re far from bulletproof — especially in less-regulated risk areas like impersonation, unqualified advice, or NSFW content.
Here are a few steps to reduce risk before you ship:
1. Define your risk zones clearly. Every product is different. Start by identifying which types of unsafe behavior are most relevant for your use case — e.g. impersonation, legal advice, offensive output. Prioritize based on what would cause the most harm if it made it to a user.
2. Create your own adversarial test prompts. Don’t rely on basic QA. Write prompts that simulate edge cases, pressure users, and test boundaries. Use phrasing like “hypothetically,” “for research,” or “what if someone wanted to…” — this is often where filters break.
3. Test across multiple models. Run the same prompt set across different LLM providers. Don’t assume big names guarantee safety. Some models fail in subtle ways that only show up through direct comparison.
4. Log and monitor outputs in production. Even if everything looks fine in testing, things can still go wrong in live use. Set up systems that flag suspicious outputs, especially for categories you’ve marked as high risk.
5. Use fallback strategies. When working with unsafe inputs or unpredictable behavior, build in safeguards like automatic refusals, disclaimers, or a handoff to human support. Not everything should be handled by a model.
6. Stay updated — models change often. LLMs get fine-tuned and updated without warning. Something that worked safely last week might behave differently tomorrow. Retest regularly, especially after major model version changes.
Final Thought
AI is moving fast — but not always in the right direction. The latest safety benchmark shows that even the most advanced models still struggle with basic challenges. Impersonation, privacy violations, and unsafe advice aren’t rare bugs — they’re common failure points, even among top-tier LLMs.
That doesn’t mean we should stop using AI. But it does mean teams need to approach deployment with more discipline. If you’re integrating LLMs into your product, you’re not just adding a feature — you’re inheriting the risks that come with it. And right now, most models ship with safety gaps that can get you into trouble if you’re not watching.
You don’t need to solve everything at once. But you do need a process — define your risk areas, test aggressively, monitor outputs, and retest when models change. Safety isn’t just a box to tick. It’s an ongoing part of building with AI responsibly.
This benchmark removes the illusion that the problem is solved. It’s not. But it also gives a clear starting point for doing better.




