Home Artificial Intelligence Our approach to alignment research

Our approach to alignment research

0
Our approach to alignment research

There may be currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to come across various latest alignment problems that we don’t observe yet in current systems. A few of these problems we anticipate now and a few of them will probably be entirely latest.

We consider that finding an indefinitely scalable solution is probably going very difficult. As a substitute, we aim for a more pragmatic approach: constructing and aligning a system that could make faster and higher alignment research progress than humans can.

As we make progress on this, our AI systems can take over increasingly of our alignment work and ultimately conceive, implement, study, and develop higher alignment techniques than we now have now. They’ll work along with humans to make sure that their very own successors are more aligned with humans.

We consider that evaluating alignment research is substantially easier than producing it, especially when supplied with evaluation assistance. Subsequently human researchers will focus increasingly of their effort on reviewing alignment research done by AI systems as an alternative of generating this research by themselves. Our goal is to coach models to be so aligned that we will off-load almost all the cognitive labor required for alignment research.

Importantly, we only need “narrower” AI systems which have human-level capabilities within the relevant domains to do in addition to humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.

Language models are particularly well-suited for automating alignment research because they arrive “preloaded” with numerous knowledge and knowledge about human values from reading the web. Out of the box, they aren’t independent agents and thus don’t pursue their very own goals on the earth. To do alignment research they don’t need unrestricted access to the web. Yet numerous alignment research tasks may be phrased as natural language or coding tasks.

Future versions of WebGPT, InstructGPT, and Codex can provide a foundation as alignment research assistants, but they aren’t sufficiently capable yet. While we don’t know when our models will probably be capable enough to meaningfully contribute to alignment research, we predict it’s vital to start ahead of time. Once we train a model that may very well be useful, we plan to make it accessible to the external alignment research community.

LEAVE A REPLY

Please enter your comment!
Please enter your name here