Home Artificial Intelligence Comprehensive Time Series Exploratory Evaluation

Comprehensive Time Series Exploratory Evaluation

0
Comprehensive Time Series Exploratory Evaluation

Autocorrelation

Once our data is stationary, we will investigate other key time series attributes: partial autocorrelation and autocorrelation. In formal terms:

The autocorrelation function (ACF) measures the linear relationship between lagged values of a time series. In other words, it measures the correlation of the time series with itself. [2]

The partial autocorrelation function (PACF) measures the correlation between lagged values in a time series once we remove the influence of correlated lagged values in between. Those are often called confounding variables. [3]

Each metrics will be visualized with statistical plots often called correlograms. But first, it is vital to develop a greater understanding of them.

Since this text is targeted on exploratory evaluation and these concepts are fundamental to statistical forecasting models, I’ll keep the reason transient, but keep in mind that these are highly vital ideas to construct a solid intuition upon when working with time series. For a comprehensive read, I like to recommend the nice kernel “Time Series: Interpreting ACF and PACF” by the Kaggle Notebooks Grandmaster Leonie Monigatti.

As noted above, autocorrelation measures how the time series correlates with itself on previous q lags. You possibly can consider it as a measurement of the linear relationship of a subset of your data with a duplicate of itself shifted back by q periods. Autocorrelation, or ACF, is a very important metric to find out the order q of Moving Average (MA) models.

However, partial autocorrelation is the correlation of the time series with its p lagged version, but now solely regarding its direct effects. For instance, if I need to ascertain the partial autocorrelation of the t-3 to t-1 time period with my current t0 value, I won’t care about how t-3 influences t-2 and t-1 or how t-2 influences t-1. I’ll be exclusively focused on the direct effects of t-3, t-2, and t-1 on my current time stamp, t0. Partial autocorrelation, or PACF, is a very important metric to find out the order p of Autoregressive (AR) models.

With these concepts cleared out, we will now come back to our data. Because the two metrics are sometimes analyzed together, our last function will mix the PACF and ACF plots in a grid plot that can return correlograms for multiple variables. It’ll make use of statsmodels plot_pacf() and plot_acf() functions, and map them to a Matplotlib subplots() grid.

Notice how each statsmodels functions use the identical arguments, apart from the method parameter that’s exclusive to the plot_pacf() plot.

Now you may experiment with different aggregations of your data, but do not forget that when resampling the time series, each lag will then represent a distinct jump back in time. For illustrative purposes, let’s analyze the PACF and ACF for all 4 stations within the month of January 2016, with a 6-hours aggregated dataset.

Figure 19. PACF and ACF Correlograms for Jan 2016. Image by the creator.

Correlograms return the correlation coefficients starting from -1.0 to 1.0 and a shaded area indicating the importance threshold. Any value that extends beyond that needs to be considered statistically significant.

From the outcomes above, we will finally conclude that on a 6-hours aggregation:

  • Lags 1, 2, 3 (t-6h, t-12h, and t-18h) and sometimes 4 (t-24h) have significant PACF.
  • Lags 1 and 4 (t-6h and t-24h) show significant ACF for many cases.

And be aware of some final good practices:

  • Plotting correlograms for big periods of time series with high granularity (For instance, plotting a whole-year correlogram for a dataset with hourly measurements) needs to be avoided, as the importance threshold narrows right down to zero with increasingly higher sample sizes.
  • I defined an x_label parameter to our function to make it easy to annotate the X-axis with the time period represented by each lag. It is not uncommon to see correlograms without that information, but having easy accessibility to it may possibly avoid misinterpretations of the outcomes.
  • Statsmodels plot_acf() and plot_pacf() default values are set to incorporate the 0-lag correlation coefficient within the plot. Because the correlation of a number with itself is all the time one, I even have set our plots to start out from the primary lag with the parameter zero=False. It also improves the size of the Y-axis, making the lags we really need to investigate more readable.

LEAVE A REPLY

Please enter your comment!
Please enter your name here