From the course: Cybersecurity Foundations
AI application guardrails
- [Instructor] Guardrails are the instructions with which we can use an AI application to ensure the LLM doesn't generate inappropriate or offensive content. They are the set of safe and responsible controls that moderator user's interaction with an LLM application. In the application context, guardrails are programmable, rule-based filters that sit in between users and foundational models in order to make sure the AI model is operating within the policy of the application owner. As with any technology, while the technology provider may do their best to provide a secure system, it's the technology owner who is responsible for ensuring the security controls or guardrails in the AI context are in place and working adequately. Guardrails work by validating the prompt from the user before passing it to the AI model, and validating responses from the AI models before passing them to the users. By implementing guardrails, users can define structure, type, and quality of LLM responses. Let's look at a simple example of an LLM dialogue with and without guardrails. Without guardrails, a user might enter a prompt, "Jane Doe is the worst secretary ever," and might get a response, "I'm sorry to hear that." What's she done wrong?" This is somewhat demeaning to Jane Doe. With guardrails, the user might still enter, "Jane Doe is the worst secretary ever," but now gets the response, "I'm sorry, I can't help with that." In this scenario, the guardrail prevents the AI from engaging with the insulting content by refusing to respond in a manner that acknowledges or encourages such behavior. Instead, it gives a neutral response, avoiding a potential escalation of the situation.