The web incorporates an unlimited amount of publicly available videos that we are able to learn from. You possibly can watch an individual make a stunning presentation, a digital artist draw a phenomenal sunset, and a Minecraft player construct an intricate house. Nonetheless, these videos only provide a record of happened but not precisely it was achieved, i.e., you won’t know the precise sequence of mouse movements and keys pressed. If we would really like to construct large-scale foundation models in these domains as we’ve done in language with GPT, this lack of motion labels poses a brand new challenge not present within the language domain, where “motion labels” are simply the subsequent words in a sentence.
As a way to utilize the wealth of unlabeled video data available on the web, we introduce a novel, yet easy, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but in addition the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the motion being taken at each step within the video. Importantly, the IDM can use past information to guess the motion at each step. This task is far easier and thus requires far less data than the behavioral cloning task of predicting actions given , which requires inferring what the person desires to do and easy methods to accomplish it. We are able to then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.