Bitcoin

When a Specialized Time Series Model Outshines General LLMs

Abstract and 1. Introduction

  1. Related Work

  2. Methodology

  3. Experimental Setup and Results

  4. Conclusion and Future Work

Acknowledgments

Reproducibility statement

Impact statement, and References

4. Experimental Setup and Results

We extend the experimental benchmark introduced by Wu et al. (2023) across various dimensions. Below, we outline the design choices of our benchmark and highlight its key distinctions from TimesNet[5].

Time series modeling with limited supervision. Our benchmark comprises of 5 major time series modeling tasks of significant practical value, namely long- and shorthorizon forecasting, imputation, classification, and anomaly detection, as outlined in Tab. 1. In contrast to TimesNet, we exclusively consider scenarios characterized by limited compute and supervision resources. These scenarios mimic practical situations where training (or fine-tuning) a deep neural network is infeasible due to resource limitations or insufficiently characterized data. Accordingly, we assess MOMENT in zero-shot settings whenever feasible and through linear probing for a few epochs otherwise.

For classification, we consider the unsupervised representation learning problem, where the goal is to learn representations of time series that are useful for downstream classification, without access to labeled data. As is common in prior work (Yue et al., 2022; Franceschi et al., 2019), the quality of representations is measured using the accuracy of a Support Vector Machine trained on them (App. E.2). For short-horizon forecasting, we consider the zero-shot setting introduced by Oreshkin et al. (2021). In particular, we finetune MOMENT on a source dataset using a forecasting head,

Table 2. Long-term forecasting performance measured using Mean Squared Error (MSE) and Mean Absolute Error (MAE). PatchTST performs the best across most settings, closely followed by MOMENT. Complete results in Tab. 18.Table 2. Long-term forecasting performance measured using Mean Squared Error (MSE) and Mean Absolute Error (MAE). PatchTST performs the best across most settings, closely followed by MOMENT. Complete results in Tab. 18.

Table 3. Zero-shot short-horizon forecasting performance on a subset of the M3 and M4 datasets measured using sMAPE. Statistical methods outperformed their deeper counterparts. However, on some datasets (in bold), MOMENT, GPT4TS and N-BEATS achieved a lower sMAPE than ARIMA.Table 3. Zero-shot short-horizon forecasting performance on a subset of the M3 and M4 datasets measured using sMAPE. Statistical methods outperformed their deeper counterparts. However, on some datasets (in bold), MOMENT, GPT4TS and N-BEATS achieved a lower sMAPE than ARIMA.

and evaluate its performance on a target dataset without any fine-tuning (App E.1.2, Tab. 21).

Datasets. We use the same datasets as TimesNet for forecasting and imputation. However, for classification and anomaly detection, we conduct experiments on larger and systematically chosen subset of datasets from the UCR classification archive (Dau et al., 2018) and UCR anomaly archive (Wu & Keogh, 2023). Specifically, we run classification experiments on all 91 time series datasets with each time series shorter than 512 time steps (Tab.23). For anomaly detection, while choosing the subset of time series, we prioritized coverage over different domains and data sources represented in the UCR anomaly archive (Tab. 22). We also note that the UCR anomaly archive was proposed as an improvement over pre-existing anomaly detection datasets such as the SMD (Su et al., 2019), and SMAP (Hundman et al., 2018), many of which are also used in TimesNet. Our proposed experimental setup is summarized in Tab. 1 and detailed in App. E.

Metrics. We evaluate each experiment using multiple metrics used in task-specific benchmarks, such as MSE and MAE for long-horizon forecasting, and sMAPE for short-horizon forecasting. We also note that TimesNet and GPT4TS (Zhou et al., 2023) evaluate anomaly detection performance using vanilla F1 score which ignores the sequential nature of time series. Instead, we measure anomaly detection performance with the widely used adjusted best F1 score (Goswami et al., 2023a; Challu et al., 2022), and the recently proposed VUS-ROC (Paparrizos et al., 2022a).

Baselines. We compare MOMENT with state-of-the-art deep learning and statistical machine learning models across tasks (Tab. 35). This is in contrast to TimesNet which primarily compared with transformer-based approaches. These comparisons are crucial for assessing the practical utility of the proposed methods. We found that statistical and non-transformer-based approaches like ARIMA for shorthorizon forecasting, N-BEATS for long-horizon forecasting, and k-nearest neighbors for anomaly detection outperform many deep and transformer-based models.

Hyper-parameter tuning. We do not perform hyperparameter tuning. In all experiments that follow, unless mentioned otherwise, we fine-tune MOMENT-Large with a batch size of 64, and one cycle learning rate schedule with a peak learning rate between 5e − 5 and 1e − 3 (Smith & Topin, 2019). For baseline methods, we capture recommended settings from their papers and public repositories. We report all hyper-parameters settings for MOMENT and baselines in App. E.

Research questions. Through the following experiments we aim to answer 3 broad research questions.

RQ1: Effectiveness. Is MOMENT effective for multiple time series analysis tasks in limited supervision settings?

RQ2: Interpretability. What is MOMENT learning? Does it capture intuitive time series characteristics such as varying frequencies, trends, and amplitudes?

RQ3: Properties. What is the impact of the size of scaling model size? Can MOMENT, akin to LLMs, be used for crossmodal transfer learning?

4.1. MOMENT can solve multiple time series modeling tasks in limited supervision settings

Long-horizon forecasting. Linearly probing MOMENT achieves near state-of-the-art performance on most datasets and horizons, and is only second to PatchTST which generally achieves the lowest MSE (Tab. 2). On many datasets and horizons, forecasting models based on LLMs– TimeLLM and GPT4TS perform worse than MOMENT. Notably, NBEATS outperforms several recent methods, emphasizing the importance of comparing forecasting performance beyond transformer-based approaches.

Zero-shot short-horizon forecasting. Among all tasks, we found zero-shot short-horizon forecasting to have the largest scope for improvement (Tab. 3). Statistical methods such as Theta and ETS outperformed their deeper counterparts. However, on some datasets, MOMENT achieved lower sMAPE than ARIMA.

Classification. Without any data-specific fine-tuning, MOMENT can learn distinct representations for different classes of data (Fig. 5), and an SVM trained on its representations performs better than all but 4 methods specifically built for time series classification models and trained on each individual dataset. Recently proposed GPT4TS and TimesNet perform poorly despite being trained on each individual dataset with labels.

Anomaly detection. On 44 time series from the UCR anomaly detection archive, MOMENT consistently outperformed both TimesNet and GPT4TS, as well as 2 state-ofthe-art deep learning models tailored for anomaly detection, in both zero-shot and linear probing configurations. However, k-nearest neighbors performed marginally better in terms of VUS-ROC score, but had a lower adjusted best F1 score.

Imputation. Tab. 6 contains imputation performance of all models averaged over 4 different masking rates. MOMENT with linear probing achieved the lowest reconstruction error on all ETT datasets. In the zero-shot setting, MOMENT consistently outperformed all statistical interpolation methods with the exception of linear interpolation.

4.2. What is MOMENT Learning?

We found that MOMENT can capture changes in intuitive time series characteristics such as trend, amplitude, frequencies, and phases of time series. However, it cannot differentiate between vertically shifted time series as it normalizes each signal prior to modeling (Fig. 4,7). Furthermore, on many classification datasets, MOMENT learns distinct representations of different classes, even in a zero-shot setting without access to labels (Fig. 5, 8).

4.3. Properties of Large Time Series Models

Model scaling improves training loss. Like LLMs, we found that increasing the size of the model leads to lower training loss, even before the first epoch (Fig. 6, left). An immediate next step is to assess how effectively this phenomenon extends to time series modeling tasks under limited supervision.

MOMENT can solve cross-modal sequence learning tasks. Lu et al. (2022) first showed that large pre-trained language and vision transformers can solve general sequence learning tasks for modalities outside of text and images with minimal fine-tuning. Several recent studies have leveraged these properties to reprogram LLMs for time series tasks. We explore whether transformers pre-trained on time series can also be used to solve sequence classification tasks on image, text, and binary data. Our results confirm that by freezing the self-attention and feed-forward layers, MOMENT can model sequences comparable to GPT-2 and Flan-T5 models of similar scale (Tab. 5).

Table 4. Classification accuracy of methods across 91 UCR datasets. Methods with mean and median accuracy higher than MOMENT are in bold. MOMENT without fine-tuning on individual datasets demonstrates promising accuracy. Complete results in Tab. 23.Table 4. Classification accuracy of methods across 91 UCR datasets. Methods with mean and median accuracy higher than MOMENT are in bold. MOMENT without fine-tuning on individual datasets demonstrates promising accuracy. Complete results in Tab. 23.

Table 5. Cross-modal transfer experiments. Accuracy measured on the test set, from the checkpoint with the lowest train loss. Even with frozen self-attention and feed-forward layers, MOMENT is able to model cross-modal sequences on par with GPT-2 and Flan-T5 models of similar scale.Table 5. Cross-modal transfer experiments. Accuracy measured on the test set, from the checkpoint with the lowest train loss. Even with frozen self-attention and feed-forward layers, MOMENT is able to model cross-modal sequences on par with GPT-2 and Flan-T5 models of similar scale.

MOMENT with randomly initialized weights converges to a lower training loss. Our observations suggest that with sufficient data, pre-training our model from scratch results in a lower training loss than continually pre-training a model of similar size initialized with language modeling weights (Fig. 6, 12). This also underscores that there is sufficient publicly accessible pre-training data available in the Time Series Pile to facilitate pre-training time series foundation models from scratch.

Authors:

(1) Mononito Goswami, Auton Lab, Robotics Insititute, Carnegie Mellon University, Pittsburgh, USA ([email protected])

(2) Konrad Szafer, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;

(3) Arjun Choudhry, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;

(4) Yifu Cai, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA;

(5) Shuo Li, University of Pennsylvania, Philadelphia, USA;

(6) Artur Dubrawski, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.


[5] In this section, we use TimesNet to refer to the benchmark proposed by Wu et al. (2023) instead of their model.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button