Built upon technology from DALL-E 3, OpenAI's flagship text-to-image model, Sora combines a diffusion model with a transformer, allowing it to process video data across both space and time. The transformer's capability to handle long sequences of data, similar to its application in language models like GPT-4, enables Sora to be trained on diverse video types in terms of resolution, duration, aspect ratio, and orientation. While the showcased videos highlight Sora's strengths, including high-definition output and effective occlusion handling, OpenAI acknowledges the need for further refinement, particularly in ensuring long-term coherence.