TimesFM-2.5 Leakage Analysis: A Detailed Comparison With TiReX

Alex Johnson

-Oct 1, 2025

TimesFM-2.5 Leakage Analysis: A Detailed Comparison With TiReX

Evaluation of TimesFM-2.5 and Clarification About Leakage

Introduction

First and foremost, a huge thank you for developing such a valuable benchmark. This article delves into a critical clarification regarding the leakage column for TimesFM-2.5, a topic that has sparked considerable discussion within the community. Understanding data leakage is paramount in evaluating the true performance of machine learning models, particularly in time series forecasting. This analysis aims to dissect the discrepancies in reported leakage percentages between TimesFM-2.5 and TiReX, explore the datasets contributing to these differences, and ultimately, provide a comprehensive understanding to ensure fair evaluation and rankings.

Understanding the Leakage Discrepancy: TimesFM-2.5 vs. TiReX

The core of the discussion revolves around the reported leakage percentages for TimesFM-2.5 and TiReX. According to the leaderboard, TiReX exhibits a leakage of 1%, while TimesFM-2.5 shows a significantly higher leakage of 8%. This variance raises questions, especially considering the datasets used in pretraining these models. Both models leverage the GiftEvalPretrain dataset, but TimesFM-2.5 also incorporates the Wikimedia Pageviews dataset, cutoff Nov 2023, Google Trends top queries, cutoff EoY 2022, and synthetic and augmented data, whereas TiReX uses chronos_datasets (excluding Zero Shot Benchmark data), a subset of GiftEvalPretrain, and synthetic data. Given that GiftEvalPretrain is common to both, the discrepancy in leakage percentages warrants a closer examination. The primary question is: what accounts for the 7% difference in leakage between these two models? Understanding this difference is crucial as it can significantly impact the evaluation rankings, particularly if the leakage-prone datasets are replaced by chronos-bolt numbers.

Datasets Used by TimesFM-2.5 and TiReX

To better understand the potential sources of leakage, let's dissect the datasets used by both models:

TimesFM-2.5:
- GiftEvalPretrain: A dataset designed for evaluating time series forecasting models.
- Wikimedia Pageviews: Data on page views from Wikipedia, cutoff November 2023.
- Google Trends: Top search queries from Google Trends, cutoff end of year 2022.
- Synthetic and augmented data: Artificially generated data to enhance model training.
TiReX:
- chronos_datasets: A collection of diverse time series datasets (excluding Zero Shot Benchmark data for training).
- GiftEvalPretrain: A subset of the GiftEvalPretrain dataset.
- Synthetic Data: Artificially generated data.

The key difference lies in the inclusion of Wikimedia Pageviews and Google Trends in TimesFM-2.5's training data, which are not present in TiReX. These datasets, while valuable for capturing real-world trends and patterns, also have the potential to introduce leakage if not handled carefully. Leakage occurs when information from the future inadvertently influences the model's training, leading to artificially inflated performance on the validation and test sets. In the context of time series forecasting, this often happens when the training data contains information that would not be available at the time of prediction. For instance, using future page views or search trends to predict past values would constitute leakage.

Identifying the 7 Missing Datasets in TimesFM-2.5's Pretraining Corpus

The central question remains: which seven datasets present in TimesFM-2.5's pretraining corpus are absent in TiReX, and how do these datasets contribute to the 8% leakage? Pinpointing these specific datasets is essential for a transparent evaluation process. It's possible that certain datasets within Wikimedia Pageviews or Google Trends contain information that overlaps with the evaluation period, thereby causing leakage. For example, if the evaluation set includes data points within the time frame covered by the Google Trends data (cutoff EoY 2022), the model could be leveraging future information to make predictions. Further investigation into the specific subsets of Wikimedia Pageviews and Google Trends used in TimesFM-2.5's pretraining is necessary. Additionally, the synthetic and augmented data used in both models needs scrutiny. If the synthetic data generation process incorporates future information, it could also contribute to leakage. A clear understanding of these datasets and their potential for leakage will ensure a more accurate and fair comparison between TimesFM-2.5 and TiReX.

Impact on Evaluation Rankings

The presence of data leakage has a significant impact on the evaluation rankings of forecasting models. Models that inadvertently leverage future information during training may exhibit inflated performance metrics, leading to an overestimation of their true predictive capabilities. In the context of this benchmark, the 8% leakage reported for TimesFM-2.5 raises concerns about the validity of its current ranking. If a substantial portion of the model's performance is attributable to leakage, its true performance on unseen data may be lower than what the leaderboard suggests. This is where the potential replacement of leakage-prone datasets with chronos-bolt numbers becomes crucial. By excluding datasets that contribute to leakage, the evaluation process can focus on assessing the models' ability to generalize to new data, providing a more accurate reflection of their real-world performance. The decision to replace these datasets is not merely about adjusting rankings; it's about ensuring that the benchmark accurately reflects the capabilities of different forecasting techniques. A fair evaluation process encourages the development of robust models that can genuinely learn from historical data without relying on future information.

Inference Speed Improvements

In addition to the discussion on data leakage, it's important to highlight the improvements made in inference speed for TimesFM-2.5. Addressing the inference speed issues previously raised, the team has successfully implemented optimizations that have resulted in a remarkable 7x speedup. This improvement is a significant achievement, as it directly translates to faster prediction times and more efficient model deployment. Inference speed is a critical factor in many real-world applications, particularly those that require real-time forecasting or handle large volumes of data. A model that is both accurate and fast is highly desirable, as it can provide timely and actionable insights. The optimizations implemented for TimesFM-2.5 not only enhance its practicality but also make it a more competitive option for various forecasting tasks. This underscores the importance of continuous improvement and addressing practical challenges in model development. By focusing on both accuracy and efficiency, the TimesFM-2.5 team has made significant strides in creating a robust and versatile forecasting tool. This improvement reflects a commitment to not only theoretical performance but also real-world usability.

Implications of Faster Inference Speed

The 7x speed improvement in TimesFM-2.5's inference speed has profound implications for its usability and applicability in real-world scenarios. Faster inference translates directly to reduced computational costs, making the model more accessible for resource-constrained environments and large-scale deployments. In industries such as finance, supply chain management, and energy forecasting, where timely predictions are critical, this speed boost can be a game-changer. For instance, a financial institution using TimesFM-2.5 for high-frequency trading can execute trades more quickly and efficiently, potentially leading to higher profits. Similarly, a supply chain manager can respond more rapidly to changes in demand, optimizing inventory levels and reducing costs. The faster inference speed also enables more rapid experimentation and iteration during the model development process. Researchers and practitioners can test different model configurations and hyperparameters more quickly, accelerating the pace of innovation. This is particularly important in the rapidly evolving field of time series forecasting, where new techniques and algorithms are constantly emerging. By addressing the inference speed bottleneck, the TimesFM-2.5 team has significantly enhanced the model's practical value and its potential for adoption in various industries. This improvement underscores the importance of considering both accuracy and efficiency when evaluating and developing machine learning models.

Conclusion

In summary, the discussion surrounding data leakage in TimesFM-2.5 highlights the complexities of evaluating time series forecasting models. Identifying the specific datasets contributing to the 8% leakage and understanding their impact on model performance is crucial for ensuring fair and accurate comparisons. The clarifications sought regarding the 7 datasets not shared with TiReX will significantly contribute to this effort. Moreover, the remarkable 7x improvement in inference speed demonstrates a commitment to practical usability, making TimesFM-2.5 a more competitive and valuable tool. By addressing both accuracy and efficiency, the developers have made significant strides in advancing the field of time series forecasting. Further investigation into the leakage issue and the continued pursuit of performance enhancements will undoubtedly contribute to the development of even more robust and reliable forecasting models in the future. This ongoing dialogue and collaborative effort within the community are essential for advancing the state-of-the-art in machine learning. By addressing these challenges head-on, we can build more trustworthy and effective forecasting systems that benefit a wide range of applications.

For further information on data leakage in machine learning, consider exploring resources like this article on Data Leakage in Machine Learning.