DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting

Xiangfei Qiu, Xingjian Wu, Yan Lin, Chenjuan Guo, Jilin Hu, Bin Yang

2025·Computer Science·27 citations

8 min read

Read the full paper at DOI or on arxiv

TL;DR

DUET addresses two coupled difficulties in MTSF: temporal distribution shift (heterogeneous temporal patterns) and complex/noisy channel interdependencies.

Briefing Cornell Notes

Briefing

The paper addresses a core problem in multivariate time series forecasting (MTSF): how to achieve accurate predictions when (i) the temporal behavior of real-world series changes over time (temporal distribution shift, TDS), producing heterogeneous temporal patterns, and (ii) relationships among channels (variables) are complex, intertwined, and noisy, making it difficult to model channel interactions robustly. These two issues matter because many practical forecasting settings—traffic, energy, weather, finance, health, and web activity—are non-stationary and involve many correlated signals. If a model assumes a single stationary temporal pattern or uses a dense, indiscriminate channel interaction mechanism, it can underfit heterogeneity or amplify irrelevant/noisy channels.

To tackle both challenges simultaneously, the authors propose DUET (Dual Clustering Enhanced Multivariate Time Series Forecasting), a general framework that performs dual clustering along the temporal and channel dimensions. The Temporal Clustering Module (TCM) explicitly models heterogeneous temporal patterns by clustering time series windows into fine-grained latent distribution groups and routing each channel’s univariate series to a corresponding set of pattern extractors. The Channel Clustering Module (CCM) models cross-channel dependencies using a channel-soft-clustering strategy in the frequency domain: it learns a metric over Fourier-transformed channel representations, converts distances into probabilistic channel relationships, and then sparsifies these relationships into a learned binary-ish channel mask via Gumbel-softmax reparameterization. Finally, a Fusion Module (FM) uses masked attention to combine temporal features from TCM with the sparse channel mask from CCM, followed by a linear predictor.

Methodologically, DUET is implemented as a neural architecture with the following high-level pipeline: (1) instance normalization to unify train/test distributions; (2) TCM to compute temporal features per channel using a distribution router and multiple linear-based pattern extractors; (3) CCM to compute a channel mask matrix $M \in R^{N \times N}$ that indicates which channels should attend to which others; (4) FM to fuse temporal features using masked attention, and (5) a linear projection to forecast the next $F$ steps.

TCM uses a distribution router inspired by variational autoencoders and noisy gating. For each channel’s univariate series $X_{n, :} \in R^{T}$ , two MLP-like encoders produce parameters $μ$ and $σ$ for $M$ candidate latent distributions. A noisy gating mechanism (implemented with a reparameterization-style sampling $Z_{n} = Encoder_{μ} (X_{n, :}) + ϵ ⊙ Softplus (Encoder_{σ} (X_{n, :}))$ ) projects to distribution logits $H (X_{n, :})$ , then selects the top $k$ distributions using a KeepTopK operator and normalizes with softmax. Each selected distribution routes the series to one of $M$ linear-based pattern extractors. Each extractor decomposes the series into trend and seasonal components using moving average (trend via average pooling; seasonal as residual) and applies separate linear transformations to each component, producing a feature vector. An aggregator then forms the final temporal feature for the channel as a weighted sum of the $k$ extractor outputs using the router’s gates.

CCM first transforms each channel to the frequency domain using real FFT, then normalizes the magnitude (amplitude) representation. It learns a Mahalanobis distance metric $d (X_{i, :}, X_{j, :}) = (X_{i, :}^{chan} - X_{j, :}^{chan})^{T} Q (X_{i, :}^{chan} - X_{j, :}^{chan})$ where $Q$ is positive semidefinite (constructed as $Q = A^{T} A$ ). Distances are converted into relationship scores $C_{ij} = 1/ D_{ij}$ (with diagonal zeroed) and then normalized into probabilities $P_{ij}$ using a discount factor $γ \in (0, 1)$ . The final channel mask $M$ is sampled from $Bernoulli (P_{ij})$ using Gumbel-softmax reparameterization so gradients can flow. This sparsification is intended to reduce the influence of irrelevant/noisy channels while retaining beneficial neighbors.

The Fusion Module uses a masked attention mechanism: it computes $Q, K, V$ from temporal features and applies the channel mask $M$ to attention scores by adding $⊙ M$ and $(1 - M) ⊙ (- \infty)$ , ensuring that only selected channel-to-channel interactions contribute to the softmax. The paper reports that the dual clustering reduces the need for dense attention over all channels, improving efficiency; the theoretical complexity analysis claims DUET’s dominant fusion cost is $O (N^{2})$ per layer, comparable to other channel-attention approaches.

For evaluation, the authors conduct extensive experiments on 25 real-world datasets spanning 10 domains, using the TFB benchmark codebase for unified evaluation. In the main text they report results on 10 datasets: ETTh1, ETTh2, ETTm1, ETTm2, Exchange, Weather, Electricity, ILI, Traffic, and Solar. They use mean squared error (MSE) and mean absolute error (MAE) as metrics. Forecasting horizons vary by dataset: for ILI and the shorter-length datasets they use $F \in {24, 36, 48, 60}$ , while for others they use longer horizons such as $F \in {96, 192, 336, 720}$ . Look-back windows are tuned per dataset family (e.g., 36/104 for some shorter series; 96/336/512 for others). Training uses L1 loss with Adam optimizer in PyTorch on an NVIDIA Tesla A800 GPU.

Key findings are reported in multiple ways. First, the paper claims DUET achieves state-of-the-art performance and provides a headline comparison: compared with the second-best baseline PDF, DUET reduces MSE by 7.1% and MAE by 6.5% (averaged across the reported evaluation setting). Second, the ablation study quantifies the contribution of each module. On ETTh2, removing TCM (w/o TCM) increases MSE from 0.334 to 0.344 and MAE from 0.383 to 0.391; removing CCM (w/o CCM) yields MSE 0.343 and MAE 0.391; replacing masked attention with full attention (Full Attention) also degrades to MSE 0.344 and MAE 0.389. On ETTm2, DUET’s MSE/MAE are 0.247/0.307, while w/o TCM gives 0.256/0.310 and w/o CCM gives 0.256/0.312. On Weather, DUET achieves 0.218/0.252, while w/o TCM is 0.225/0.255 and w/o CCM is 0.232/0.262. These results support that both temporal clustering and channel clustering are needed, and that masked attention is important for robustness.

Third, the paper provides domain- and horizon-level evidence from the main results table. For example, on ETTh2 with $F = 96$ , DUET reports MSE 0.270 and MAE 0.336, outperforming PDF (0.276/0.341), iTransformer (0.297/0.348), and others; on ETTh2 with $F = 720$ , DUET is 0.382/0.425 versus PDF 0.398/0.433 and iTransformer 0.424/0.444. On Traffic with $F = 96$ , DUET is 0.360/0.238 versus PDF 0.368/0.252 and PatchTST 0.363/0.265; on Solar with $F = 96$ , DUET is 0.169/0.195 versus PDF 0.181/0.247 and iTransformer 0.190/0.244. Across these examples, DUET consistently yields lower errors.

Fourth, the authors emphasize improvements specifically for non-stationary settings. They report that compared to the state-of-the-art Non-stationary Transformer (Stationary) model, DUET reduces MSE by 32.4% and MAE by 21.7% on non-stationary time series modeling, attributing this to TCM’s explicit modeling of heterogeneous temporal patterns.

The paper also includes additional analyses that strengthen the interpretation. It compares different distance metrics for CCM: using Euclidean, cosine, DTW, or random masks worsens performance relative to the learnable Mahalanobis metric (Table 4). It also studies the number of temporal extractors $M$ and finds that $M = 1$ is inferior, while best $M$ values vary by domain (e.g., ETTh1/ETTh2 best at $M = 4$ ; ILI best at $M = 2$ ; Exchange best at $M = 5$ ). Finally, it provides qualitative visualizations of distribution router weights and channel mask patterns, showing that samples with similar latent distribution characteristics receive similar routing weights and that CCM produces soft groups of channels based on frequency-domain similarity.

Limitations are not deeply formalized in the provided text, but several apparent constraints follow from the methodology. First, DUET introduces additional components (multiple pattern extractors, distribution router, metric learning, and stochastic mask sampling), which may increase tuning complexity and computational overhead relative to simpler baselines, even though the fusion attention is masked. Second, the paper’s clustering assumptions rely on latent distribution normality and on the usefulness of frequency-domain metric learning; performance may degrade if the frequency representation is less informative for certain domains or if the learned metric overfits. Third, the evaluation is benchmark-based and uses specific look-back and horizon settings; generalization to arbitrary forecasting setups (different sampling rates, missingness patterns, or extreme distribution shifts) is not directly quantified in the excerpt.

Practically, DUET is most relevant to practitioners and researchers who need robust multivariate forecasting under non-stationarity and noisy inter-variable relationships—e.g., forecasting traffic and energy systems with regime changes, weather/environmental prediction with complex sensor correlations, and finance/health/web analytics where temporal patterns drift and some channels may be weak or noisy. The dual clustering design suggests a reusable blueprint: explicitly route temporal segments to specialized extractors based on latent distribution characteristics, and learn sparse, task-relevant channel interactions rather than assuming either full connectivity or strict hard clustering. Researchers may also care because DUET provides a concrete mechanism (TCM+CCM+masked attention) that can be integrated into other forecasting backbones, and because the frequency-domain metric learning and soft sparsification offer an interpretable way to reason about which channels help each other.

Overall, DUET’s central contribution is the explicit dual-clustering framework that jointly addresses temporal heterogeneity from TDS and channel-interaction complexity via frequency-domain soft clustering, yielding consistent improvements across 25 datasets and particularly large gains on non-stationary benchmarks.

Cornell Notes

DUET is a dual-clustering framework for multivariate time series forecasting that explicitly models temporal distribution shift by routing each channel to distribution-specific linear pattern extractors (TCM), and models channel dependencies by learning a sparse frequency-domain channel mask via metric learning and soft clustering (CCM). Experiments on 25 real-world datasets show DUET achieves state-of-the-art forecasting accuracy, with ablations confirming that both temporal and channel clustering plus masked attention are essential.

What is the main research problem DUET targets?

Accurate multivariate time series forecasting under (1) temporal distribution shift causing heterogeneous temporal patterns and (2) complex, noisy cross-channel correlations that are hard to model flexibly.

What is DUET’s core idea?

Apply dual clustering: cluster time series into latent temporal distribution groups (TCM) and softly cluster channels in the frequency domain to produce a sparse channel mask (CCM), then fuse with masked attention (FM).

What study design and evaluation protocol does the paper use?

Benchmark-based empirical evaluation on 25 real-world datasets across 10 domains, using the TFB codebase for unified evaluation and reporting MSE/MAE across multiple forecasting horizons and tuned look-back windows.

How does TCM perform temporal clustering and routing?

For each channel’s univariate series, a distribution router encodes $μ$ and $σ$ for $M$ latent distributions, uses noisy gating to select top $k$ distributions, and routes the series to $k$ linear-based pattern extractors; outputs are aggregated by the gate weights.

How do the linear-based pattern extractors model heterogeneous temporal patterns?

Each extractor decomposes the series into trend (moving average) and seasonal (residual) components and applies separate linear transformations to each, producing a feature vector that is specialized to the routed latent distribution.

How does CCM model channel relationships?

It computes real FFT representations per channel, learns a Mahalanobis distance metric in frequency space, converts distances to probabilities $P_{ij}$ , and samples a sparse channel mask $M$ using Bernoulli resampling with Gumbel-softmax for differentiability.

What role does the Fusion Module play?

FM uses masked attention: attention scores are multiplied by the learned channel mask $M$ , so each channel attends only to selected (sparsified) channels when combining temporal features.

What do the ablation results show about module importance?

Removing TCM or CCM degrades performance. For example, on ETTh2 DUET improves over w/o TCM (MSE 0.334 vs 0.344) and w/o CCM (MSE 0.334 vs 0.343), and replacing masked attention with full attention also worsens results (ETTh2 MSE 0.334 vs 0.344).

What headline performance gains does the paper report?

Compared with the second-best baseline PDF, DUET reduces MSE by 7.1% and MAE by 6.5%. For non-stationary modeling, it reports a 32.4% MSE and 21.7% MAE reduction versus the Non-stationary Transformer (Stationary) model.

Review Questions

How does DUET’s TCM differ from implicit handling of non-stationarity in prior transformer-based methods?
Explain how CCM’s learnable Mahalanobis metric in frequency space leads to a sparse mask $M$ and how that mask changes attention computation.
Which ablation comparisons most directly support the claim that masked attention (CSC) is necessary, and what are the reported metric changes?
What assumptions does DUET make about latent temporal distributions (e.g., normality) and about the usefulness of frequency-domain similarity for channel clustering?
How would you expect DUET to behave if channel correlations are weak or if the data are nearly stationary? Use the paper’s ablation and discussion to justify your answer.

Key Points

1
DUET addresses two coupled difficulties in MTSF: temporal distribution shift (heterogeneous temporal patterns) and complex/noisy channel interdependencies.
2
TCM explicitly clusters each channel’s univariate series into latent distribution groups using a distribution router with noisy gating, then routes to multiple linear-based trend/seasonality extractors.
3
CCM learns cross-channel relationships in the frequency domain via a learnable Mahalanobis distance metric, converts distances to probabilities, and sparsifies them into a learned channel mask $M$ using Gumbel-softmax.
4
FM fuses temporal features using masked attention, so channel interactions are restricted to the sparsified, task-relevant neighbors from CCM.
5
On ETTh2 ( $F = 96$ ), DUET achieves MSE 0.270 and MAE 0.336, outperforming key baselines such as PDF (0.276/0.341) and iTransformer (0.297/0.348).
6
Ablations confirm both modules matter: on ETTh2, removing TCM or CCM increases MSE from 0.334 to about 0.343–0.344; full attention (no masking) also degrades performance.
7
DUET reports strong gains on non-stationary data: 32.4% MSE and 21.7% MAE reduction versus the Non-stationary Transformer (Stationary) model, attributed to TCM’s explicit heterogeneity modeling.

Highlights

“DUET achieves an impressive 7.1% reduction in MSE and a 6.5% reduction in MAE” compared with the second-best baseline PDF.

On ETTh2 (ablation), DUET’s MSE/MAE are 0.334/0.383, while w/o TCM is 0.344/0.391 and w/o CCM is 0.343/0.391.

On ETTh2 with F=96, DUET reports MSE 0.270 and MAE 0.336 (best among compared models in the table excerpt).

“DUET achieves a significant reduction of 32.4% in MSE and 21.7% in MAE” versus the Non-stationary Transformer (Stationary) model on non-stationary time series modeling.

CCM’s channel mask is learned by sampling Mij​≈Bernoulli(Pij​) with Gumbel-softmax reparameterization, enabling sparse, differentiable channel interactions.

Topics

Multivariate time series forecasting
Non-stationary time series modeling
Temporal distribution shift (TDS)
Mixture-of-experts / routing mechanisms
Frequency-domain representation learning
Metric learning
Sparse attention and masked attention
Channel dependency modeling
Benchmarking and evaluation in time series forecasting

Mentioned

DUET (code repository: https://github.com/decisionintelligence/DUET)
TFB (Time Series Forecasting Benchmark)
PyTorch
NumPy (implied by FFT usage)
NVIDIA Tesla A800 GPU
Adam optimizer
Xiangfei Qiu
Xingjian Wu
Yan Lin
Chenjuan Guo
Jilin Hu
Bin Yang
MTSF - Multivariate Time Series Forecasting
TDS - Temporal Distribution Shift
TCM - Temporal Clustering Module
CCM - Channel Clustering Module
CSC - Channel-Soft-Clustering
CHC - Channel-Hard-Clustering
CI - Channel-Independent
CD - Channel-Dependent
FM - Fusion Module
FFT - Fast Fourier Transform
rFFT - Real Fast Fourier Transform
MAE - Mean Absolute Error
MSE - Mean Squared Error
TFB - Time Series Forecasting Benchmark
VAE - Variational Autoencoder
OOM - Out-Of-Memory
SOTA - State of the Art