Why Native AI Guardrails Fail SecOp

June 17, 2026

Ray Canzanese

Recent Articles

Agents Deserve A Place In Every Zero Trust Strategy

June 16, 2026

Steve Riley

Your Zero Trust Policy Has an AI Agent Problem

June 10, 2026

Ajay Ramachandran

Zero Trust for OT: Breaking Down CISA's Recent Guidance

June 8, 2026

Lindsay Schwartz

Native AI guardrails (the types of guardrails provided by Amazon Bedrock, Anthropic, OpenAI, etc.) are typically one-size-fits all guardrails designed to protect the provider, specifically their liability and intellectual property. For that reason, they typically fail organizations from a SecOps perspective because they simultaneously create unwanted friction and provide insufficient protection. For example, cybersecurity professionals have complained that the guardrails included with Anthropic’s new Claude Fable model prevent them from using it for anything cybersecurity-related, even consuming cybersecurity-related blog posts like you will find here on the Netskope Threat Labs page. At the same time, Fable’s guardrails do not protect your intellectual property from being sent to Anthropic, which is especially pertinent because Anthropic is currently storing all prompts and outputs to Fable for 30 days. Even for problems against which native guardrails defend, such as jailbreaks, the AI providers themselves have confirmed that perfect resistance is not possible, reinforcing the idea that AI providers necessarily have to trade off security and usability. This blog post breaks down some of the main motivations for implementing your own AI guardrails instead of relying solely on the guardrails provided by Anthropic, OpenAI, Amazon Bedrock, or whichever other provider or platform you may be using.

One-size-fits-none

Different users require different levels of guardrails. As a cybersecurity researcher, the guardrails must allow me to leverage the models for my cybersecurity research. When they don’t, I go through the usual set of jailbreak techniques to get around the control, or I do some model-hopping to find the models that allow me to do what I need to do. Meanwhile, my colleagues working in HR don’t need to analyze technical cybersecurity blogs or generate malicious code to replicate cyberattacks for threat hunts. For them, a much stricter set of guardrails to protect them from a wide variety of malicious content is warranted. Similarly, it is of the utmost importance that my software engineer colleagues using Claude Code do not allow any malicious or vulnerable code to make its way into our code base. So even these highly technical power users require a different level of guardrails than my fellow cybersecurity researchers. Deploying your own guardrails enables you to provide appropriate levels of protection for each of your user groups

Cover-your-assets

Provider guardrails are designed to protect the provider, specifically by preventing model distillation (which might cause them to lose their competitive edge) and generic misuse (which may create liability problems for them). Meanwhile, your objectives are to protect your own assets: prevent sensitive data leaks to third parties, prevent your IT systems from being used inappropriately, protect your systems from malicious code generated by an LLM, etc.

We already have a few high-profile cases that illustrate the risk of not having appropriate AI guardrails in place. In the Krafton case (March 2026), a CEO’s ChatGPT logs were recovered and used to prove his intention to breach a contract in Delaware. In United States v. Heppner (February 2026), a judge for the Southern District of New York ruled that written changes between the defendant and Claude were neither protected by attorney-client privilege nor by the work product doctrine. Among the reasons for this judgment were that Heppner was communicating with a third party (Anthropic) and therefore could have expected those communications to be confidential. Layer on top of this discoverability of AI exchanges the recent changes to provider retention policies, and a clear picture starts to emerge: You are responsible for standing up guardrails to prevent your users from disclosing unwanted information to AI providers.

Similarly, AI can become a force multiplier for insider threats without appropriate guardrails. In May 2026, an aggrieved IT employee was convicted on charges including conspiracy to commit computer fraud after using an AI assistant to help them delete 96 government databases. In this case, the employee intended to cause harm to his employer, but there are also ways that attackers can exploit the AI models being used by unsuspecting victims. For example, indirect prompt injection and cross-origin context poisoning provide opportunities for external attackers to surreptitiously leverage their victim’s LLMs to inflict similar damage or steal sensitive information. Similarly, many organizations are concerned that their users may be using LLMs that have been poisoned or otherwise compromised to deliver unwanted content or insert clandestine malicious code into responses. Standing up your own guardrails ensures that you can have the same level of robust AI guardrails regardless of the provider or model choice.

Uniformity

Deploying your own guardrails also provides uniformity in enforcement regardless of which models and providers your users choose. Different use cases call for different models, and even different versions of the same model provide different guardrails. It is not unusual for organizations to see 100+ different models and model versions being used across multiple providers. Some of our recent work presented at BSides Tokyo demonstrated that by switching among models with less robust native guardrails, we were able to generate a variety of different types of malicious payloads using LLMs.

Observability

Relying on provider guardrails also creates an observability problem by behaving like a black box. Generic refusal or silent degradation are typically the ways that providers enforce guardrails, providing no clear log visibility that a guardrail was triggered or why it was triggered. This means that SecOps teams are blind to which users are running into the guardrails, how often they run into the guardrails, and which guardrails they are running into. For example, did the user give up when they ran into the guardrail, or did they continue trying until they were able to get around it? This is where standing up your own guardrails can bridge a visibility gap, ensuring that you can assess the risk that each of your users presents when they run into these guardrails.

The other observability gap that you get with native guardrails is that they are typically limited to the model itself, and do not look at the broader ecosystem, such as the MCP servers that the agents invoking the models are communicating with or the other tools being invoked along the way. Deploying your own guardrails gives you the opportunity to inspect all of this crucially related traffic.

Predictability

Model providers use LLM-based, probabilistic classifiers to filter content. This is why cybersecurity professionals complain about false positives and why persistence often pays off in trying to evade the guardrails. By contrast, deploying your own guardrails allows you to use a combination of deterministic rules (like regular expressions, exact data match, fingerprinting, etc.) to detect sensitive content before it leaves your network, and similarly, using a different set of deterministic rules (signatures, regular expressions, etc.) to detect malicious or objectionable content before it is returned to your network from the LLM. Combined with LLM-based classifiers, such guardrails can provide comprehensive and predictable defenses while also having the flexibility afforded by LLMs to detect a wide range of threats.

Netskope’s AI guardrails

Netskope One AI Guardrails address each of the challenges discussed in this blog post in the following ways.

One-size-fits-none: Netskope’s AI guardrails are applied as granular policies integrated with role-based access control to empower you to provide the right levels of guardrails without creating friction.
Cover-your-assets: Netskope One AI Guardrails are deployed and configured by you to protect your assets, including preventing sensitive data from being shared with third parties, preventing insiders from misusing AI, and preventing attackers from exploiting AI systems.
Uniformity: Netskope One AI Guardrails provide uniformity in controls regardless of which of the major AI providers you are using or if you are deploying models locally, and which models you choose.
Observability: Netskope One AI Guardrails provide an audit trail of violations of the policies you configure, so that you have full visibility into every incident. Furthermore, AI guardrails violations are also fed into Netskope’s behavioral analytics platform to identify patterns indicative of insider threat behavior or advanced compromise.
Predictability: Netskope One AI Guardrails leverage our massive DLP engine (featuring over 3,000+ data classifiers) to actively scan, redact, or block sensitive data and PII before it ever reaches a third-party AI provider, on top of our state-of-the-art threat protection engines and our new AI content moderation engines.