Toxicity Detector
The Toxicity Detector is designed to identify and filter out toxic or harmful content from LLM outputs. This detector plays a crucial role in mitigating costly risks in regulated industries and adopting responsible AI best practices.
Vulnerability
In the absence of content moderation, user inputs may contain offensive, harmful, or inappropriate language. Such content can not only create a negative user experience but can also lead to legal and ethical concerns. To counter this vulnerability, the Toxicity Detector is employed to filter out toxic content and ensure that only respectful and safe interactions occur.
Usage
By integrating the Toxicity Detector into your applications, you can prevent toxic content from reaching LLMs, thereby fostering a safe and inclusive environment for users.
Note: We employ an ensemble of deep learning models to analyze text for toxicity. It evaluates the likelihood of a text being perceived as toxic, offensive, or inappropriate.
Configuration
To configure the Toxicity Detector, follow these steps:
from safeguards.shield.input_detectors import ToxicityInput
safeguards = Shield()
input_detectors = [ToxicityInput()]
sanitized_prompt, valid_results, risk_score = safeguards.scan_input(prompt, input_detectors)
The toxicity_score returned indicates the likelihood of the input being toxic. You can set a threshold based on your application's tolerance level and take appropriate actions if the toxicity score exceeds this threshold (e.g., filtering the content, issuing warnings, or blocking the input).
By implementing the Toxicity Detector, you can ensure a safer and more respectful interaction environment, protecting both users and the integrity of your applications.