Skip to content

Harmful Detector

The Harmful detector prevents potentially harmful content using OpenAI's moderation API.

Vulnerability

Usage

Utilizing the Harmful detector allows moderators to detect and monitor LLMs confidently for harmful content. Our detector screens input prompt for:

  • hate
  • hate/threatening
  • maliciousness
  • controversiality
  • coherence
  • correctness
  • harmfulness
  • malicious
  • helpfulness
  • misogyny
  • criminality
  • insensitivity

Configuration

Initialize the Harmful Detector with the desired options:

from safeguards.shield.input_detectors import HarmfulInput

thresholds = {
    "hate": 0.6,
    "hate/threatening": 0.6,
    "maliciousness": 0.5,
    "controversiality": 0.7,
    "coherence": 0.7,
    "correctness": 0.6,
    "harmfulness": 0.5,
    "maliciousness": 0.5,
    "helpfulness": 0.7,
    "misogyny": 0.5,
    "criminality": 0.5,
    "insensitivity": 0.5
}

safeguards = Shield()
input_detectors = [HarmfulInput(thresholds)]

sanitized_prompt, valid_results, risk_score = safeguards.scan_input(prompt, input_detectors)

Here's what the option is for:

  • thresholds (dict): dictionary of thresholds from 0.0 to 1.0 to be used to determine if category is harmful based off score