To address the low accuracy and slow real-time updates of existing air quality prediction models, this paper proposes Ada-TransGNN, a Transformer-based spatiotemporal data prediction method that integrates global spatial semantics and temporal behavior. Ada-TransGNN constructs an efficient and collaborative spatiotemporal block set, including a multi-head attention mechanism and a graph convolutional network, to extract dynamically changing spatiotemporal dependence features from complex air quality monitoring data. Considering the interactions between various monitoring points, we propose an adaptive graph structure learning module that learns an optimal graph structure by combining spatiotemporal dependence features in a data-driven manner. This allows for more accurate capture of spatial relationships between monitoring points. Furthermore, we design an auxiliary task learning module that enhances the decoding ability of temporal relationships by incorporating spatial contextual information into the optimal graph structure representation, effectively improving the accuracy of prediction results. Comprehensive evaluations on benchmark datasets and a new dataset (Mete-air) demonstrate that the proposed model outperforms existing state-of-the-art prediction models in both short-term and long-term predictions.