This paper proposes a novel deepfake video detection technique that utilizes temporal disparity at the pixel level to overcome the limitations of existing spatial frequency-based deepfake detection methods. Existing methods simply stack spatial frequency spectra between frames to express temporal information, which has the limitation of failing to detect temporal artifacts at the pixel level. The proposed method extracts features that are highly sensitive to temporal disparity by performing a 1D Fourier transform on the temporal axis for each pixel, and is particularly effective in areas where unnatural movements are likely to occur. In addition, we introduce an attention proposal module trained in an end-to-end manner to accurately find areas containing temporal artifacts, and expand the range of detectable forgery artifacts by using a joint transformer module that effectively integrates spatial-temporal context information and pixel-level temporal frequency features. It provides robust performance in various and difficult detection scenarios, contributing greatly to the advancement of deepfake video detection.