We have trained a model to realize a brand new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) as a substitute of simply rewarding the proper final answer (“end result supervision”). Along with boosting performance relative to end result supervision, process supervision also has a very important alignment profit: it directly trains the model to supply a chain-of-thought that’s endorsed by humans.