This paper questions whether large-scale language models (LLMs) can effectively leverage causal knowledge for prediction and generation. We experimentally demonstrate that LLMs trained directly on large-scale data learn spurious correlations rather than true causal relationships, resulting in poor performance, particularly in out-of-distribution (OOD) scenarios. To address this, we propose Causal Attention Tuning (CAT), a novel method for injecting fine-grained causal knowledge into the attention mechanism. CAT automatically generates token-level causal cues using prior human knowledge and introduces a re-attention mechanism to guide training, helping the model focus on causal structures and mitigating noise and bias in attention scores. Experimental results on the proposed Spurious Token Game (STG) benchmark and several downstream tasks demonstrate that CAT effectively leverages causal knowledge for prediction and is robust in OOD scenarios. CAT achieves an average performance improvement of 5.76% on the STG dataset and 1.56% on downstream tasks. In particular, the OOD performance in STG_M of the Llama-3.1-8B model improved from 64.5% to 90.5%, and the OOD performance in STG_H of the Qwen model improved from 25.4% to 55.9%.