-
Notifications
You must be signed in to change notification settings - Fork 950
Description
Is your feature request related to a current problem? Please describe.
Currently the min_train_series_length of regression models is defined as:
max(3, -self.lags["target"][0] + self.output_chunk_length)
However, we believe the minimum required length of the series might not be the same for each model inheriting from RegressionModel. Particularly, I believe that for LightGBM and CatBoost the min_train_series_length should be
max(3, -self.lags["target"][0] + self.output_chunk_length + 1)
The reason for this is that without the +1 the min_train_series_length would result in the creation of a single training sample whereas LightGBM and CatBoost actually require at least two samples when calling fit(). Below is an illustration of how this would work for LightGBM
In below example:
date quantity
2021-01 10
2021-02 8
2021-03 14
2021-04 7
2021-05 6
2021-06 5
With self.lags["target"][0] = -4 and self.output_chunk.length = 2, this will result in a min_train_series_length of 6
df_X_y, created in the function _create_lagged_data, results in only 1 sample as all rows with nan values are removed.
Sklearn's check
_LGBMCheckXY(X, y, accept_sparse=True, force_all_finite=False, ensure_min_samples=2) will for this example throw an error.
For random forest the required samples is the default 1, but for catboost it is also 2 (although there is no explicit check performed, trying to fit CatBoost on a single training sample will fail as (I believe) it cannot deal with situation in which all features are constant (which is the case if you only pass a single training sample)).
Describe proposed solution
overwrite min_train_series_length in gradient_boosted_model and catboost_model:
max(3, -self.lags["target"][0] + self.output_chunk_length + 1)
Describe potential alternatives
Additional context