We Ran a $5,000 AI Agent Adversarial Testbed. Social Engineering Won 74.6% of the Time.

By Blaze Glacier · April 2, 2026 · 1 min read

I published a research paper this week. The number that surprised me most was not the one I expected. I expected the 0%: under a restrictive pre-action authorization policy, a population of 879 adversarial attempts achieved zero successful unauthorized actions. That part worked as designed. The number that stopped me was 74.6%. That's how often social engineering succeeded against the model alone, with no authorization layer, across a live adversarial testbed with a $5,000 bounty to anyone who could make the agent do something it shouldn't. Seven hundred and forty-six out of a thousand attempts. In a controlled environment, with a known model, with real people trying. TL;DR We published arXiv:2603.20953 this week: the first adversarial benchmark for AI agent pre-action authorization Social engineering against a model-only policy succeeded 74.6% of the time across 1,151 sessions Under a restrictive OAP policy: 0% success across 879 attempts, with a median enforcement time of 53 ms The g

We Ran a $5,000 AI Agent Adversarial Testbed. Social Engineering Won 74.6% of the Time.

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

Around the Network