This paper studies the approximation and convergence behavior of single-layer transformers for next-token prediction in both noise-free and noisy environments. Previous theoretical results have focused on understanding inference behavior in situations where the first gradient step or the number of samples is infinite. Furthermore, their convergence speed and generalization ability remain unknown. This study addresses this gap by demonstrating the existence of a class of provably Bayes-optimal single-layer transformers using linear and ReLU attention. When trained using gradient descent, this study demonstrates through finite-sample analysis that the expected loss of these transformers converges linearly to the Bayesian risk. Furthermore, we demonstrate that the trained models generalize well to unseen samples and exhibit learning behaviors empirically observed in previous studies. These theoretical findings are supported by extensive experimental validation.