This paper critically evaluates the generalizability and robustness of asset pricing and stock trading strategies using large-scale language models (LLMs). We point out that previous studies have overestimated the effectiveness of LLM strategies due to their narrow time horizons and limited stock portfolios. We propose a backtesting framework, FINSABER, to evaluate LLM-based market timing strategies over a long period of time (over 20 years) and over 100 stocks.