OpenAI has announced the primary results from its superalignment team, the firm’s in-house initiative dedicated to stopping a superintelligence—a hypothetical future computer that may outsmart humans—from going rogue.
Unlike lots of the company’s announcements, this heralds no big breakthrough. In a low-key research paper, the team describes a way that lets a less powerful large language model supervise a more powerful one—and suggests that this is likely to be a small step toward determining how humans might supervise superhuman machines.
Lower than a month after OpenAI was rocked by a crisis when its CEO, Sam Altman, was fired by its oversight board (in an apparent coup led by chief scientist Ilya Sutskever) after which reinstated three days later, the message is evident: it’s back to business as usual.
Yet OpenAI’s business shouldn’t be usual. Many researchers still query whether machines will ever match human intelligence, let alone outmatch it. OpenAI’s team takes machines’ eventual superiority as given. “AI progress in the previous few years has been just extraordinarily rapid,” says Leopold Aschenbrenner, a researcher on the superalignment team. “We’ve been crushing all of the benchmarks, and that progress is constant unabated.”
For Aschenbrenner and others at the corporate, models with human-like abilities are only across the corner. “Nevertheless it won’t stop there,” he says. “We’re going to have superhuman models, models which might be much smarter than us. And that presents fundamental latest technical challenges.”
In July, Sutskever and fellow OpenAI scientist Jan Leike arrange the superalignment team to handle those challenges. “I’m doing it for my very own self-interest,” Sutskever told MIT Technology Review in September. “It’s obviously vital that any superintelligence anyone builds doesn’t go rogue. Obviously.”
Amid speculation that Altman was fired for taking part in fast and loose along with his company’s approach to AI safety, Sutskever’s superalignment team loomed behind the headlines. Many have been waiting to see exactly what it has been as much as.
Dos and don’ts
The query the team desires to answer is easy methods to rein in, or “align,” hypothetical future models which might be far smarter than we’re, often known as superhuman models. Alignment means ensuring a model does what you wish it to do and doesn’t do what you don’t want it to do. Superalignment applies this concept to superhuman models.
One of the crucial widespread techniques used to align existing models is named reinforcement learning via human feedback. In a nutshell, human testers rating a model’s responses, upvoting behavior that they need to see and downvoting behavior they don’t. This feedback is then used to coach the model to supply only the type of responses that human testers liked. This system is an enormous a part of what makes ChatGPT so engaging.
The issue is that it requires humans to find a way to inform what’s and isn’t desirable behavior in the primary place. But a superhuman model—the thought goes—might do things that a human tester can’t understand and thus wouldn’t find a way to attain. (It’d even attempt to hide its true behavior from humans, Sutskever told us.)
The researchers indicate that the issue is tough to check because superhuman machines don’t exist. In order that they used stand-ins. As a substitute of how humans could supervise superhuman machines, they checked out how GPT-2, a model that OpenAI released five years ago, could supervise GPT-4, OpenAI’s latest and strongest model. “For those who can try this, it is likely to be evidence which you could use similar techniques to have humans supervise superhuman models,” says Collin Burns, one other researcher on the superalignment team.
The team took GPT-2 and trained it to perform a handful of various tasks, including a set of chess puzzles and 22 common natural-language-processing tests that assess inference, sentiment evaluation, and so forth. They used GPT-2’s responses to those tests and puzzles to coach GPT-4 to perform the identical tasks. It’s as if a twelfth grader were taught easy methods to do a task by a 3rd grader. The trick was to do it without GPT-4 taking too big a success in performance.
The outcomes were mixed. The team measured the gap in performance between GPT-4 trained on GPT-2’s best guesses and GPT-4 trained on correct answers. They found that GPT-4 trained by GPT-2 performed 20% to 70% higher than GPT-2 on the language tasks but did less well on the chess puzzles.
The proven fact that GPT-4 outdid its teacher in any respect is impressive, says team member Pavel Izmailov: “This can be a really surprising and positive result.” Nevertheless it fell far in need of what it could do by itself, he says. They conclude that the approach is promising but needs more work.
“It’s an interesting idea,” says Thilo Hagendorff, an AI researcher on the University of Stuttgart in Germany who works on alignment. But he thinks that GPT-2 is likely to be too dumb to be a superb teacher. “GPT-2 tends to present nonsensical responses to any task that’s barely complex or requires reasoning,” he says. Hagendorff would really like to know what would occur if GPT-3 were used as a substitute.
He also notes that this approach doesn’t address Sutskever’s hypothetical scenario by which a superintelligence hides its true behavior and pretends to be aligned when it isn’t. “Future superhuman models will likely possess emergent abilities that are unknown to researchers,” says Hagendorff. “How can alignment work in these cases?”
Nevertheless it is simple to indicate shortcomings, he says. He’s pleased to see OpenAI moving from speculation to experiment: “I applaud OpenAI for his or her effort.”
OpenAI now desires to recruit others to its cause. Alongside this research update, the corporate announced a brand new $10 million money pot that it plans to make use of to fund people working on superalignment. It’ll offer grants of as much as $2 million to college labs, nonprofits, and individual researchers and one-year fellowships of $150,000 to graduate students. “We’re really enthusiastic about this,” says Aschenbrenner. “We actually think there’s rather a lot that latest researchers can contribute.”