Prerequisites
It is not required that the reader knows about time series analysis or forecasting. However, it is assumed that he or she has experience developing machine learning models (at any level) and handling basic statistical concepts.
From Machine Learning to Time Series Forecasting
Moving from machine learning to time-series forecasting is a radical change — at least it was for me. As a data scientist for SAP Digital Interconnect, I worked for almost a year developing machine learning models. It was a challenging, yet enriching, experience that gave me a better understanding of how machine learning can be applied to business problems.
Soon after, an opportunity to apply predictive modeling to financial forecasting fell in my lap. Without any prior experience, I had to adapt quickly in order to learn how to solve the problems presented to me. Looking back on that experience made me want to share some tips that can help you make the same transition.
Introduction to Time Series
The objective of a predictive model is to estimate the value of an unknown variable. A time series has time (t) as an independent variable (in any unit you can think of) and a target dependent variable . The output of the model is the predicted value for y at time t .
In most cases, a prediction is a specific value, e.g., the kind of object in a picture, the value of a house, whether a mail is spam or not, etc. However, a forecast is a prediction (representing the median or mean) that includes a confidence interval that expresses the level of certainty. Usually, both the 80% and 95% confidence levels are provided.
Whenever data is recorded at regular intervals of time, it is called a time series. You can think of this type of variable in two ways:
The data is univariate, but it has an index (time) that creates an implicit order; or
The dataset has two dimensions: the time (independent variable) and the variable itself as dependent variable.
If you have experience working in machine learning, you must make some adjustments when working with time series. Below are seven key differences to keep in mind when making the transition.
1. Features should be handled with care.
As a machine learning practitioner, you may already be used to creating features, either manually (feature engineering) or automatically (feature learning). Either way, creating features is one of the most important and time-consuming tasks in applied machine learning.
However, in time series forecasting, you don’t create features — at least not in the traditional sense. This is especially true when you want to forecast several steps ahead, and not just the following value.
This does not mean that features are completely off limits. Instead, they should be used with care because of the following reasons:
It is not clear what the future real values will be for those features.
If the features are predictable, i.e., they have some patterns, you can build a forecast model for each of them. However, keep in mind that using predicted values as features will propagate the error to the target variable, which may cause higher errors or produce biased forecasts.
A pure time series model may have similar or even better performance than one using features.
Besides, some forecasting models are only based on historical values of the variable, like Exponential Smoothing (ETS) and Autoregressive Integrated Moving Average (ARIMA) models.
Is it also possible to combine time series with feature engineering using time series components and time-based features. The first refers to the properties (components) of a time series, and the latter refers to time-related features, which have definite patterns and can be calculated in a deterministic way. You can add them to any time-series models that can handle predictors. Some examples are:
Time Series Components
Trend: A trend exists when a series increases, decreases, or remains at a constant level with respect to time. Therefore, the time is taken as a feature.
Seasonality: This refers to the property of a time series that displays periodical patterns that repeats at a constant frequency (m). In the following example, you can observe a seasonal component with m = 12, which means that the periodical pattern repeats every twelve months. (Usually, to handle seasonality, time series models include seasonal variables as dummy features, using m — 1 binary variables to avoid correlation between features.)
Cycles: Cycles are seasons that do not occur at a fixed rate. For example, in the time series below, the annual Canadian Lynx trappings display seasonal and cyclic patterns. These do not repeat at regular time intervals and may occur even if the frequency is 1 (m = 1).
Time Series Predictors
Dummy variables: Similar to how seasonality can be added as a binary feature, other features can be added in binary format to the model. You can add holidays, special events, marketing campaigns, whether a value is outlier or not, etc. However, you should remember that these variables need to have definite patterns.
Number of days: These can be easily calculated even for future months/quarters and may affect forecasts, especially for financial data. Here you can include:
- Number of days
- Number of trading days
- Number of weekend days
- …and so on
Lagged values: You can include lagged values of the variable as predictors. Some models like ARIMA, Vector Autoregression (VAR), or Autoregressive Neural Networks (NNAR) work this way.
Time series components are highly important to analyzing the variable of interest in order to understand its behavior, what patterns it has, and to be able to choose and fit an appropriate time-series model. Time series predictors, on the other hand, may help some models to recognize additional patterns and improve the quality of forecasts. Both time series components and features are key to interpreting the behavior of the time series, analyzing its properties, identifying possible causes, and more.
2. There may be smaller datasets.
You may be used to feeding thousands, millions, or billions of data points into a machine learning model, but this is not always the case with time series. In fact, you may be working with small- to medium-sized time series, depending on the frequency and type of variable.
At first glance, you might think that this is a drawback. But in reality, there are some benefits to having small- to medium-sized time series:
The datasets will fit the memory of your computer.
In some cases, you can analyze the entire dataset, and not just a sample.
The length of the time series is convenient for making plots that can be graphically analyzed. This is a very important point, because we rely heavily on plot analyses in the time-series analysis step.
This does not mean that you will not be working with huge time series, but you must be prepared and able to handle smaller time series as well.
Any dataset that includes a time-related field can benefit from time-series analysis and forecasting. However, if you have a bigger dataset, a Time Series Database (TSDB) may be more appropriate. Some of these datasets come from events recorded with a timestamp, systems logs, financial data, data obtained from sensors (IoT), etc. Since TSDB works natively with time series, it is a great opportunity to apply time series technique to large-scale datasets.
3. A different algorithmic approach is required.
One of the most important properties an algorithm needs in order to be considered a time-series algorithm is the ability to extrapolate patterns outside of the domain of training data. Many machine learning algorithms do not have this capability, as they tend to be restricted to a domain that is defined by training data. Therefore, they are not suited for time series, as the objective of time series is to project into the future.
Another important property of a time series algorithm is the ability to derive confidence intervals. While this is a default property of time series models, most machine learning models do not have this ability because they are not all based on statistical distributions. Confidence intervals can be estimated, but they may not be as accurate. This will be expanded on further in Section 6.
You may think that only simple statistical models are used for time-series forecasting. That is not true at all. There are many complex models or approaches that may be very useful in some cases. Generalized Autoregressive Conditional Heteroskedasticity (GARCH), Bayesian-based models, and VAR are only a few. There are also neural network models that can be applied to time series which use lagged predictors and can handle features, such as Neural Networks Autoregression (NNAR). There are even time-series models borrowed from deep learning, specifically in the RNN (Recurrent Neural Network) family, like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks.
However, all of these models lack interpretability, which is crucial to business leaders who want to make data-driven decisions. The desired property of the model must be aligned with business objectives for the project to be successful.
These are some of the common algorithms used for time series forecasting:
4. Both evaluation metrics and residuals diagnostics are used.
The most common evaluation metrics for forecasting are RMSE, which you may have used on regression problems; MAPE, as it is scale-independent and represents the ratio of error to actual values as a percent; and MASE, which indicates how well the forecast performs compared to a naïve average forecast.
After a forecasting model has been fit, it is important to assess how well it is able to capture patterns. While evaluation metrics help determine how close the fitted values are to the actual ones, they do not evaluate whether the model properly fits the time series. Instead, the residuals are a good way to evaluate this. As you are trying to capture the patterns of a time series, you would expect the errors to behave as white noise, as they represent what cannot be captured by the model. White noise must have the following properties:
The residuals are uncorrelated (Acf = 0)
The residuals follow a normal distribution, with zero mean (unbiased) and constant variance
If either of the two properties are not present, it means that there is room for improvement in the model.
The zero-mean property can easily be verified with a T-test for the mean. Normality and constant variance properties can be visually checked with a histogram of the residuals, or with an appropriate univariate normality test. And the first property can be verified in two ways:
Apply a portmanteau test to check the hypothesis that residuals are uncorrelated.
Plot the Autocorrelation function (ACF) and evaluate that at least 95% of the spikes are on the interval , where T is the size of the time series.
The following is an example of residuals that behave as white noise. The residuals have zero mean and constant variance, and seem to be normally distributed. All the spikes of the ACF lie within the desired interval.
5. The right resolution must be chosen.
While working with time series, you must have a clear understanding of the objective of your analysis. Assume that the business objective is to forecast at a yearly level. There are two technical ways you can approach this:
Use the yearly totals and fit a model to forecast the required number of years.
In the case that you have the values available at a quarterly or monthly level, build a time series model to forecast the required months or quarters, and aggregate to find the total per year.
Aim for the most granular level possible. When using aggregates, the model is learning patterns at a macro level. This not a bad choice, but there may be some patterns at the granular level that the model is not paying attention to. Like in our example, using monthly or quarterly data may yield better results than a yearly forecast.
There is another benefit from doing this as well. You may think that after adding the forecasts the error may propagate to the total. However, it is the opposite case. If the model you built is unbiased, the mean of the residuals will be zero or close to zero, and therefore the sum of the residuals will be close to zero:
Therefore, we have:
This means that if residuals behave as white noise, you could get a very low error on the aggregated total.
Also keep in mind that working at a level that is too granular may present noisy data that is difficult to model. In our example where we forecasted at a yearly level, using quarterly, monthly, or even a weekly level may be appropriate. But a daily, hourly, or a lower level may be too granular and noisy for the problem. Therefore, try to work at an appropriate level of resolution.
6. Provide confidence intervals on top of predictions.
As previously stated, forecasts are predictions that always include confidence intervals, usually 80% and 95%. Alternatively, you could choose to use the standard deviation of the residuals as the sample standard deviation, allowing the confidence intervals to be calculated using an appropriate distribution, like the normal or exponential.
For some models, e.g., neural networks, which are not based on probability distributions, you can run simulations of the forecasts and calculate confidence intervals from the distribution of the simulations.
7. Some models will have either high accuracy or high error.
In comparison to other models, the performance of time-series forecasting may differ. Remember that you are assuming that past patterns are indicators of what may occur in the future, and therefore they get replicated or projected. This means that if patterns continue the way they are, your forecasts will be highly accurate.
However, if patterns change, either gradually or abruptly, the forecasts may deviate highly from actual results. There is a chance that “black swan” or “gray swan” events may occur. According to Investopedia:
Black swan: An event or occurrence that deviates beyond what is normally expected of a situation and is extremely difficult to predict.
Gray swan: An event that can be anticipated to a certain degree, but is considered unlikely to occur and may have a sizable impact if it does occur.
This frequently occurs in economic time series. When this occurs, it is preferable to first evaluate the impact, and then, if required, update the forecasts using recent data after the event has passed.
Conclusion
I hope this guide will help you to have an easier and less painful transition from machine learning to time series forecasting. As you may have observed, there are many concepts that overlap, while others are completely different or need to be adapted.
References
Dunning, T., & Friedman, E. (2015). Time Series Databases (1st ed.). California: O’Reilly Media. Retrieved from http://shop.oreilly.com/product/0636920035435.do
Hyndman, R., & Athanasopoulos, G. (2017). Forecasting: Principles and Practice (2nd ed.). Retrieved from https://www.otexts.org/fpp2/
Hyndman, R., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for {R}. Journal of Statistical Software, 26(3), 1-22. Retrieved from http://www.jstatsoft.org/article/view/v027i03