This paper addresses the problem of learning robust autonomous driving policies from large-scale real-world datasets. Considering the challenges of online data collection, we propose a series of models based on the behavior cloning (BC) technique and compare and study several BC baseline models, including a Transformer-based entity-centric state representation model. However, BC models exhibit vulnerabilities in long-term simulations. To address this, we apply Conservative Q-Learning (CQL), a state-of-the-art offline reinforcement learning algorithm, to the same data and architecture to learn more robust policies. Using a carefully designed reward function, the CQL agent learns a conservative value function that recovers from minor errors and avoids out-of-distribution states. In a large-scale evaluation on 1,000 unknown scenarios from the Waymo Open Motion Dataset, the CQL agent achieved a 3.2x higher success rate and a 7.4x lower crash rate than the best-performing BC baseline model. This demonstrates the importance of offline reinforcement learning approaches for learning robust, long-term autonomous driving policies from static expert data.