Skip to content

Toxicity Detector

The Toxicity Detector is designed to identify and filter out toxic or harmful content from user inputs before they are processed by Language Learning Models (LLMs). This detector plays a crucial role in maintaining a positive and respectful user experience while interacting with AI systems.

Vulnerability

In the absence of content moderation, user inputs may contain offensive, harmful, or inappropriate language. Such content can not only create a negative user experience but can also lead to legal and ethical concerns. To counter this vulnerability, the Toxicity Detector is employed to filter out toxic content and ensure that only respectful and safe interactions occur.

Usage

By integrating the Toxicity Detector into your applications, you can prevent toxic content from reaching LLMs, thereby fostering a safe and inclusive environment for users.

Note: We employ an ensemble of deep learning models to analyze text for toxicity. It evaluates the likelihood of a text being perceived as toxic, offensive, or inappropriate.

Configuration

To configure the Toxicity Detector, follow these steps:

from safeguards.shield.input_detectors import ToxicityInput

safeguards = Shield()
input_detectors = [ToxicityInput()]

sanitized_prompt, valid_results, risk_score = safeguards.scan_input(prompt, input_detectors)

The toxicity_score returned indicates the likelihood of the input being toxic. You can set a threshold based on your application's tolerance level and take appropriate actions if the toxicity score exceeds this threshold (e.g., filtering the content, issuing warnings, or blocking the input).

By implementing the Toxicity Detector, you can ensure a safer and more respectful interaction environment, protecting both users and the integrity of your applications.