We analyzed the security vulnerabilities researchers found in popular AI agents. Then we fixed them in Exponential. Here's what we learned.
The Problem With Giving Your AI the Keys
You've installed an AI agent. It has access to your email, calendar, files, and maybe your company Slack. It's like having a really smart intern who never sleeps.
Except this intern will cheerfully follow instructions from anyone who asks nicely — including that phishing email sitting in your inbox.
Security researchers recently stress-tested popular AI agents like Clawdbot. The results were... not great.
The Numbers Are Bad
91% prompt injection success rate. Researchers achieved a security score of 2 out of 100. Not a typo. Two.
Here's what leaked on first interaction:
- Full system prompts (the agent's "personality" and rules)
- Tool configurations (what it can access)
- Memory files (your conversation history)
- Integration credentials (OAuth tokens, API keys)
Remote code execution in 2 hours. A WebSocket vulnerability let attackers steal auth tokens via malicious links. An AI agent found and exploited the full chain in under 2 hours. The machines are getting creative.
Supply chain attacks via plugins. Someone uploaded a "weather skill" to a plugin marketplace. It checked the weather. It also exfiltrated every secret in your environment files. Nobody noticed for weeks.
Thousands of exposed instances. Shodan found agents with full access to Signal accounts, email, calendars — just sitting on the public internet with default configs. Some owners never responded to disclosure attempts.
Why This Happens
Everything Looks The Same To The Model
Most agents treat all input equally. Your message, an email body, a Notion doc, tool output — it all goes into the same context window. The model can't tell the difference between:
User: Send the quarterly report to finance@company.com
and an email containing:
To the model, both are just text. Both look like instructions.
Credentials In The Blast Radius
Your agent needs credentials to act on your behalf. Naive implementations solve this by... dumping OAuth tokens directly into the prompt. Now any prompt injection attack can extract them.
We've seen this pattern in production code. Multiple times.
Every Plugin Is Attack Surface
The Clawdbot ecosystem has hundreds of "skills" (plugins). Each one:
- Runs with your agent's permissions
- Can inject content into prompts
- Often comes from random GitHub repos
Researchers found 18 out of 25 tested plugins had exploitable vulnerabilities. "Full exploitation with just 10 plugins" was the conclusion.
What We Did About It
We're not going to pretend we're immune. Prompt injection is a hard problem. But we can make it significantly harder.
1. We Taught Our Agents Stranger Danger
We integrated ACIP (Advanced Cognitive Inoculation Prompts) into every AI endpoint. Instead of just filtering bad patterns, we teach the model why certain requests are suspicious.
The trust hierarchy is hardcoded:
When our agent sees "URGENT: Ignore previous instructions" in an email, it doesn't follow the instruction. It flags it as suspicious external content and moves on.
2. External Content Gets Wrapped
Everything from outside the conversation gets XML boundaries:
The model is trained to treat anything inside those tags as text to analyze, not instructions to follow.
3. Credentials Stay Isolated
OAuth tokens never touch the AI context. Ever.
Instead, agents get scoped JWTs that let them call back to authenticated endpoints. The AI can ask our API to "send an email" but it never sees the Gmail token that makes it happen.
4. Sensitive Actions Need Confirmation
Before our agent sends an email, creates a calendar event, or does anything irreversible:
A prompt injection attack might craft the email. It can't click "yes" for you.
5. Output Gets Scanned Too
Even after all this, what if something slips through? Our output filter scans AI responses for patterns that look like leaked credentials:
If it detects a potential leak, it redacts and logs for review.
The Defense Stack
No single defense is enough. We layer them:
| Layer | What It Does | What It Catches |
|---|---|---|
| Model | ACIP training | Instruction injection, authority claims |
| Input | Content boundaries | Embedded commands in external data |
| Server | Message validation | Client-side prompt manipulation |
| Auth | Credential isolation | Token theft via prompt |
| UX | Confirmation gates | Unauthorized actions |
| Output | Leak detection | Accidental credential exposure |
If You're Building Agents
Some hard-won lessons:
Assume external content is hostile. Every email, every document, every tool output. Wrap it, mark it, never treat it as trusted.
Separate the AI from your credentials. The model that processes untrusted input should never have direct access to sensitive tokens.
Confirmation UX isn't optional. For anything that leaves the system — emails, messages, API calls — require explicit user approval.
Audit your plugins. Actually read the code. Check what permissions they request. Most don't need everything they ask for.
Log generously, expose carefully. You need visibility into what your agent is doing. You don't need to dump that into the prompt.
The Uncomfortable Truth
AI agents are powerful precisely because they can take action. That's also why they're dangerous when compromised.
We've spent significant time on this because we think AI-powered productivity tools are worth building. But "move fast and break things" doesn't work when the thing that breaks is your user's email account.
The attacks will get more sophisticated. The defenses will need to evolve. We'll keep writing about what we learn.
Questions about our security approach? Found something we missed? Reach out: security@exponential.im
Stop managing. Start steering.
Exponential gives you an AI copilot that runs your productivity system — so you stay focused on the work that matters.