From static classifiers to inference engines: OpenAI's new model rethinks content moderation

From static classifiers to inference engines: OpenAI’s new model rethinks content moderation

Enterprises are keen to ensure their use of any AI model Practice safety and safe use Policies, fine tune LLMS so that they don’t answer unwanted questions.

However, most security and read-teaming occurs before deployment, “backing” policies before the models’ capabilities are fully tested in consumer readiness. Open Eye It is believed that this could offer a more flexible option for businesses and encourage more companies to introduce security policies.

The company has released two open-weight models under research preview that it believes will make businesses and models more flexible in terms of security measures. GPT-SOS SafeGuard-120B and GPTOS-SafeGuard-20B will be available under a permissive Apache 2.0 license. There are fine-tuned open source versions of Model Openai GPTOS, released in Augustmarking the first release in the OSS family since the summer.

a Blog postOpeni said that OSS Safeguards use logic “to directly interpret a developer-provider policy at developer time—classifying user messages, completions, and complete chats according to the developer’s needs.”

The company explained that, because it uses the model chain of thought (COT), developers can get an explanation of the model’s decisions for review.

“Furthermore, the policy is delivered during prediction, rather than training the model, so it’s easier for developers to iteratively revise policies to increase performance." Openai said in his post. "This approach, which we developed initially for internal use, is more flexible than the traditional method that estimates decision thresholds indirectly from a large number of labeled examples to train a classifier."

Developers can download from both models Hug face.

Flexible vs. Backing

At launch, AI models will not know a company’s preferred security motivations. While model providers red team Models and platformsthese security measures are for wider use. Companies love it Microsoft And Amazon Web Services Even The platform offers to bring Defenders for AI applications And the agent.

Enterprises use safety classification to help train a model to recognize good or bad input patterns. This helps models know which questions they should not answer. This also helps to ensure that models do not drift and give accurate answers.

“Traditional classifiers can have high efficiency with low latency and operating cost," Openai said. "But accumulating a sufficient amount of training can be time-consuming and expensive, and the classifier needs to be retrained to update or change the policy."

The models take in two inputs at the same time before the model fails at an outcome. A policy and content are required to classify under its guidelines. Openey said the model works best in situations where:

Potential harm is emerging or developing, and policies need to be adopted quickly.
The domain is extremely complex and difficult for small classifiers.
Developers don’t have enough samples to train a high-quality classifier for every threat on their platform.
Delay is less important than producing high-quality, legible labels.

The company said that GPT-OSS-SafeGuard “is different because its reasoning capabilities allow developers to apply any policy,” even they wrote during the exclusivity.

These models are based on Openai’s in-house tool, the Safety Razor, which enables its teams to be more repeatable in fixing guardrails. They often start with strict security policies, “and use relatively large amounts of compute where necessary,” then adjust the policies as they move the model through changes in production and risk assessment.

Perform safety

Openei said GPT-OSS Safeguard’s models outperformed its GPT-5 thinking and original GPT-SOS models on multipolicy accuracy based on benchmark testing. It also ran models in the Toxchat public benchmark, where they performed well, although the GPT-5 thinking and security reasoning let them down slightly.

But there is concern that this approach could centralize safety standards.

“Security is not a well-defined concept. The implementation of security standards will reflect the values and priorities of the organization that creates it, as well as the limitations and shortcomings of its model,” said John Machstein, an assistant professor of computer science at Cornell University. “If the industry as a whole adopts the standards developed by OpenAI, we risk institutionalizing a particular approach to safety and short-circuiting broader research into the safety requirements for deploying AI in many sectors of society.”

It should also be noted that OpenAI has not released a base model for the OSS family of models, so developers cannot fully iterate on them.

Openei, however, is confident that the developer community can help improve GPT-OSS-SafeGuard. It will host a hackathon in San Francisco on December 8.

Editor's pick

Get latest news

From static classifiers to inference engines: OpenAI’s new model rethinks content moderation

Flexible vs. Backing

Perform safety

Why IT Leaders Should Pay Attention to Canva’s ‘Imagination Era’ Strategy

NVIDIA researchers unlock 4-bit LLM training that matches 8-bit performance

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news