Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CauKer: classification time series foundation models can be pretrained on synthetic data only

Created by
  • Haebom

Author

Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, Ievgen Redko

Outline

This paper proposes CauKer, a novel algorithm for efficient pretraining of time series-based models (TSFMs) without the need for computationally expensive pretraining using large-scale real-world time series data. CauKer combines Gaussian Process (GP) kernel synthesis with Structural Causal Models (SCMs) to generate diverse and causally consistent synthetic time series data with realistic trends, seasonality, and nonlinear interactions. It generates data for efficient pretraining of state-of-the-art classification TSFMs with diverse architectures and pretraining methods. We experimentally demonstrate that, unlike real-world datasets, it exhibits a clear scaling law with respect to both dataset size (10,000 to 10 million samples) and model capacity (1 million to 783 million parameters).

Takeaways, Limitations

Takeaways:
We present an efficient TSFM pre-training method that reduces dependence on large real-world datasets and reduces computational costs.
Synthetic datasets generated through CauKer exhibit regular scaling laws, providing useful insights for model development and performance analysis.
We present a general pre-training data generation method applicable to TSFMs with various architectures and pre-training methods.
Limitations:
The synthetic data generated by CauKer may not perfectly reflect all the complexities of real data.
The scaling laws presented may be limited to specific experimental environments and may appear differently under other conditions.
A more in-depth qualitative assessment of synthetic data and comparative analysis with real data is needed.
👍