This paper points out that despite the advancement of diffusion-based generative models, portrait video animations still struggle with temporally coherent video generation and fast sampling due to repetitive sampling. In this paper, we present FLOAT, an audio-based interactive portrait video generation method based on a flow-consistent generative model. It utilizes learned orthogonal motion latent spaces instead of pixel-based latent spaces to enable efficient generation and editing of temporally coherent motions. To this end, we introduce a transformer-based vector field estimator with an effective frame-wise conditioning mechanism, and support speech-based emotion reinforcement to naturally integrate expressive motions. Through extensive experiments, we demonstrate that the proposed method outperforms state-of-the-art audio-based interactive portrait methods in terms of visual quality, motion fidelity, and efficiency.