To assist developers protect their applications against possible misuse, we’re introducing the faster and more accurate Moderation endpoint. This endpoint provides OpenAI API developers with free access to GPT-based classifiers that detect undesired content—an instance of using AI systems to help with human supervision of those systems. We have now also released each a technical paper describing our methodology and the dataset used for evaluation.
When given a text input, the Moderation endpoint assesses whether the content is sexual, hateful, violent, or promotes self-harm—content prohibited by our content policy. The endpoint has been trained to be quick, accurate, and to perform robustly across a variety of applications. Importantly, this reduces the probabilities of products “saying” the unsuitable thing, even when deployed to users at-scale. As a consequence, AI can unlock advantages in sensitive settings, like education, where it couldn’t otherwise be used with confidence.