a tall building lit up at night

Microsoft Research Lab – Asia

TimeDP: Creating cross-domain synthetic time-series data

Published | Updated

Time-series data—measurements collected over time like stock prices or heart rates—plays a vital role in AI forecasting systems across industries. As these systems advance, the need for time-series data is increasing, especially synthetic data, which offers numerous advantages over real-world data. In healthcare, synthetic data protects patient privacy; in finance, it enables risk-free testing of investment strategies.

However, generating high-quality synthetic time-series data that works across different domains is challenging for many AI models. Most models are trained on single-domain data and depend on labels or user-provided descriptions, which rarely generalize effectively. These labels often oversimplify complex time-series patterns, like seasonality, volatility, and trend shifts, and tend to be too domain-specific to generalize across different fields. Additionally, labeling time-series data is often done manually, which can be costly, time-consuming, and often limited by privacy concerns, making traditional methods difficult to scale.

To address these limitations, researchers at Microsoft Research Asia developed TimeDP, a diffusion-based model that generates high-quality time-series data with strong cross-domain generalization. Unlike previous approaches, TimeDP operates without labeled data or predefined styles. Instead, it learns from just a few sample sequences, capturing the underlying patterns and generating synthetic data that matches the target domain.

Example-driven generation: A flexible new approach

TimeDP introduces a novel example-driven method for generating synthetic data that eliminates the need for explicit input descriptions. Users simply provide a few time-series examples from the target domain, and the model uses them to guide the generation process, no manual labeling required.

At the heart of this approach is a Prototype Assignment Module (PAM), a method that extracts key characteristics from the examples and constructs domain prompts representing the style and structure of the target time series. These prompts guide the model during output generation, enabling it to produce domain-consistent data, even in zero-shot or few-shot scenarios.

This method offers several advantages:

  • No need for users to input detailed instructions about patterns
  • Ability to generate high-quality output in previously unseen domains
  • Reduced data collection and annotation costs

Time series prototypes: Building blocks of cross-domain generalization

TimeDP’s innovation lies in its use of time-series prototypes—modular, reusable patterns that capture the fundamental characteristics of time-series data, like trends, fluctuations, or periodicity. These prototypes act like vocabulary words in a language model, representing the core “style” elements of various domains. This structural similarity between language components and time-series prototypes is illustrated in Figure 1.

diagram
Figure 1. Time-series prototypes form domain prompts that describe time-series styles (right), similar to how prompts guide outputs in language models (left).

TimeDP leverages these building blocks to create domain prompts, enabling it to generate data tailored to new domains without requiring labeled training data. The model architecture, shown in Figure 2, includes three core components:

  • Time-series prototypes: Capture core elements like seasonal trends or volatility, allowing the model to flexibly combine patterns to synthesize domain-specific data.
  • PAM: Assigns relevant prototypes to input samples, helping the model adapt to new domains during the training and generation phases.
  • Cross-domain prompts: Derived automatically from a few examples, these prompts guide generation without the need for manually provided labels.
diagram
Figure 2. The TimeDP model framework

Evaluating TimeDP: Validating consistency across multiple domains

To test the model’s effectiveness, researchers evaluated TimeDP on 12 real-world datasets spanning four domains: energy, transportation, weather, and finance. Using evaluation metrics including Maximum Mean Discrepancy (MMD) and Kullback-Leibler (KL) divergence which measure the similarity between synthetic and real data, the team compared TimeDP’s output to both real-world data and the outputs of other state-of-the-art models.

The results, shown in Figure 2, are impressive. In intra-domain scenarios—where training and test data come from the same domain—TimeDP reduced MMD by an average of 25.9% and KL divergence by 53.0%, indicating a strong alignment between generated and real data, and significantly outperforming baseline models.

 Table 1. Intra-domain generation results
Table 1. Intra-domain generation results

TimeDP also excels in unseen domains—those to which the model had no prior exposure during training. With just a few samples—and without any fine-tuning—TimeDP generated data that closely mirrored the statistical properties of real datasets. It outperformed fine-tuned baseline models, demonstrating robust generalization capabilities. The results are shown in Table 2.

Table 2. Generation results for unseen domains
Table 2. Generation results for unseen domains

TimeDP and the future of synthetic time-series data

As demand grows for high-quality time-series data across industries, synthetic data offers a practical solution to challenges like privacy protection and data scarcity. By generating artificial data that preserves statistical patterns, TimeDP protects sensitive information, especially in fields like healthcare. Its ability to learn from a small number of unlabeled examples reduces reliance on large, labeled datasets that are often costly or difficult to obtain, making it particularly valuable in low-resource or privacy-sensitive settings.

Future research will focus on expanding TimeDP’s capabilities by incorporating domain knowledge, responding to user input through natural language, and adapting to shifting data environments. As part of a broader move toward more general-purpose synthetic time-series generation tools, TimeDP marks a promising step in supporting AI development across diverse and dynamic domains.

Continue reading

See all blog posts