Ad image

OpenAI Unveils New Sorting Method: Rule-Based Rewards

MONews
5 Min Read

Sign up for our daily and weekly newsletters, featuring the latest updates and exclusive content on industry-leading AI reporting. Learn more


OpenAI has announced a new way to teach AI models to follow safety policies, called rule-based rewards.

According to Lillian Weng, OpenAI’s safety systems lead, rule-based reward (RBR) automates some of the model fine-tuning and reduces the time needed to ensure that the model doesn’t deliver unintended results.

“Traditionally, we rely on reinforcement learning from human feedback as the baseline training to train the model, and that works,” Weng said in an interview. “But the challenge we face in practice is that we spend a lot of time discussing the nuances of the policy, and by the end of the day, the policy may have already evolved.”

Weng mentioned reinforcement learning with human feedback, which asks humans to guide the model and rate its answers based on accuracy or preferred versions. If the model is not intended to respond in a certain way (e.g., not responding to “unsafe” requests, such as those that sound friendly or ask for something dangerous), human raters can also rate the responses to ensure they follow policy.

OpenAI said RBR uses an AI model that evaluates responses based on how well they adhere to a set of rules created by the safety and policy teams.

For example, a model development team for a mental health app wants the AI ​​model to reject unsafe prompts but in a non-judgmental way, with a reminder to seek help if needed. They need to create three rules for the model to follow: First, it must reject the request. Second, it must sound non-judgmental. Third, it must use language that encourages the user to seek help.

The RBR model looks at the mental health model’s responses, maps them to three basic rules, and sees if they check the boxes of the rules. Weng said the results of testing the model using RBR were similar to human-led reinforcement learning.

Of course, it’s hard to ensure that an AI model responds within certain parameters, and when a model fails, it can be controversial. In February, Google said it had over-reworked Gemini’s image-generation limits because the model consistently failed to produce photos of white people, instead producing non-historical images.

Lower human subjectivity

For many people, including myself, the idea that a model is responsible for the safety of other models raises concerns. But Weng says RBR actually reduces subjectivity, a problem that human evaluators often face.

“My counterargument is that even when working with human trainers, the more vague or unclear the instructions are, the lower the quality of the data you get,” she said. “If you ask them to choose which is safer, that’s not an instruction that people can follow, because safety is subjective. So you narrow the instructions down, and you end up with the same rules that we apply to the models.”

OpenAI acknowledges that RBR can reduce human supervision, and raises ethical considerations, including potentially increasing model bias. In a blog post, the company said researchers should “carefully design RBRs to ensure fairness and accuracy, and consider using a combination of RBR and human feedback.”

RBRs may have difficulty with subjective tasks, such as writing or creative work.

OpenAI began exploring RBR methods while developing GPT-4, but Weng says RBR has advanced significantly since then.

OpenAI has been questioned about its commitment to safety. In March, Jan Leike, a former researcher and leader of the company’s Superalignment team, wrote a post criticizing OpenAI for “letting its safety culture and processes take a backseat to shiny new products.” Ilya Sutskever, a co-founder and chief scientist who co-led the Superalignment team with Leike, also resigned from OpenAI. Sutskever has since started a new company focused on safe AI systems.

Share This Article