
This technical report focuses on (1) our method for turning visual data of all sorts right into a unified representation that allows large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are usually not included on this report.
Much prior work has studied generative modeling of video data using a wide range of methods, including recurrent networks,[^1][^2][^3] generative adversarial networks,[^4][^5][^6][^7] autoregressive transformers,[^8][^9] and diffusion models.[^10][^11][^12] These works often give attention to a narrow category of visual data, on shorter videos, or on videos of a set size. Sora is a generalist model of visual data—it might probably generate videos and pictures spanning diverse durations, aspect ratios and resolutions, as much as a full minute of high definition video.