Delving into some of the common nightmares for data scientists

Introduction
One in every of the most important problems in linear regression is autocorrelated residuals. On this context, this text revisits linear regression, delves into the Cochrane–Orcutt procedure as a technique to solve this problem, and explores a real-world application in fMRI brain activation evaluation.
Linear regression might be some of the vital tools for any data scientist. Nonetheless, it is common to see many misconceptions being made, especially within the context of time series. Due to this fact, let’s invest a while revisiting the concept. The first goal of a GLM in time series evaluation is to model the connection between variables over a sequence of time points. Where Y is the goal data, X is the feature data, B and A the coefficients to estimate and Ɛ is the Gaussian error.
The index refers back to the time evolution of the information. Writing in a more compact form:
by the writer.
The estimation of parameters is finished through odd least squares (OLS), which assumes that the errors, or residuals, between the observed values and the values predicted by the model, are independent and identically distributed (i.i.d).
Which means that the residuals should be non-autocorrelated to make sure the best estimation of the coefficients, the validity of the model, and the accuracy of predictions.
Autocorrelation refers back to the correlation between observations inside a time series. We will understand it as how each data point is expounded to lagged data points in a sequence.
Autocorrelation functions (ACF) are used to detect autocorrelation. These methods measure the correlation between a knowledge point and its lagged values (t = 1,2,…,40), revealing if data points are related to preceding or following values. ACF plots (Figure 1) display correlation coefficients at different lags, indicating the strength of autocorrelation, and the statistical significance over the shade region.
If the coefficients for certain lags significantly differ from zero, it suggests the presence of autocorrelation.
Autocorrelation within the residuals suggests that there’s a relationship or dependency between current and past errors within the time series. This correlation pattern indicates that the errors usually are not random and will be influenced by aspects not accounted for within the model. For instance, autocorrelation can result in biased parameter estimates, especially within the variance, affecting the understanding of the relationships between variables. This leads to invalid inferences drawn from the model, resulting in misleading conclusions about relationships between variables. Furthermore, it leads to inefficient predictions, which implies the model isn’t capturing correct information.
The Cochrane–Orcutt procedure is a technique famous in econometrics and in quite a lot of areas to handle problems with autocorrelation in a time series through a linear model for serial correlation within the error term [1,2]. We already know that this violates one in all the assumptions of odd least squares (OLS) regression, which assumes that the errors (residuals) are uncorrelated [1]. Later within the article, we will use the procedure to remove autocorrelation and check how biased the coefficients are.
The Cochrane–Orcutt procedure goes as follows:
- 1. Initial OLS Regression: Start with an initial regression evaluation using odd least squares (OLS) to estimate the model parameters.
- 2. Residual Calculation: Calculate the residuals from the initial regression.
- 3. Test for Autocorrelation: Examine the residuals for the presence of autocorrelation using ACF plots or tests resembling the Durbin-Watson test. If the autocorrelation isn’t significant, there isn’t a have to follow the procedure.
- 4. Transformation: The estimated model is transformed by differencing the dependent and independent variables to remove autocorrelation. The concept here is to make the residuals closer to being uncorrelated.
- 5. Regress the Transformed Model: Perform a brand new regression evaluation with the transformed model and compute latest residuals.
- 6. Check for Autocorrelation: Test the brand new residuals for autocorrelation again. If autocorrelation stays, return to step 4 and transform the model further until the residuals show no significant autocorrelation.
Final Model Estimation: Once the residuals exhibit no significant autocorrelation, use the ultimate model and coefficients derived from the Cochrane-Orcutt procedure for making inferences and drawing conclusions!
A temporary introduction to fMRI
Functional Magnetic Resonance Imaging (fMRI) is a neuroimaging technique that measures and maps brain activity by detecting changes in blood flow. It relies on the principle that neural activity is related to increased blood flow and oxygenation. In fMRI, when a brain region becomes lively, it triggers a hemodynamic response, resulting in changes in blood oxygen level-dependent (BOLD) signals. fMRI data typically consists of 3D images representing the brain activation at different time points, due to this fact each volume (voxel) of the brain has its own time series (Figure 2).
The General Linear Model (GLM)
The GLM assumes that the measured fMRI signal is a linear combination of various factors (features), resembling task information mixed with the expected response of neural activity often called the Hemodynamic Response Function (HRF). For simplicity, we will ignore the character of the HRF and just assume that it’s a very important feature.
To know the impact of the tasks on the resulting BOLD signal y (dependent variable), we will use a GLM. This translates to checking the effect through statistically significant coefficients related to the duty information. Hence, X1 and X2 (independent variables) are information in regards to the task that was executed by the participant through the information collection convolved with the HRF (Figure 3).
Application on real data
With the intention to check this Real-World application, we are going to use data collected by Prof. João Sato on the Federal University of ABC, which is obtainable on GitHub. The independent variable fmri_data accommodates data from one voxel (a single time series), but we could do it for each voxel within the brain. The dependent variables that contain the duty information are cong and incong. The reasons of those variables are out of the scope of this text.
#Reading data
fmri_img = nib.load('/Users/rodrigo/Medium/GLM_Orcutt/Stroop.nii')
cong = np.loadtxt('/Users/rodrigo/Medium/GLM_Orcutt/congruent.txt')
incong = np.loadtxt('/Users/rodrigo/Medium/GLM_Orcutt/incongruent.txt')#Get the series from each voxel
fmri_data = fmri_img.get_fdata()
#HRF function
HRF = glover(.5)
#Convolution of task data with HRF
conv_cong = np.convolve(cong.ravel(), HRF.ravel(), mode='same')
conv_incong = np.convolve(incong.ravel(), HRF.ravel(), mode='same')
Visualising the duty information variables (features).
Fitting GLM
Using Bizarre Least Square to suit the model and estimate the model parameters, we get to
import statsmodels.api as sm#Choosing one voxel (time series)
y = fmri_data[20,30,30]
x = np.array([conv_incong, conv_cong]).T
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y,x).fit()
#view model summary
print(model.summary())
params = model.params
It’s possible to see that coefficient X1 is statistically significant, once P > |t| is lower than 0.05. That would mean that the duty indeed impact the BOLD signal. But before using these parameters to do inference, it’s essential to examine if the residuals, which implies y minus prediction, usually are not autocorrelated in any lag. Otherwise, our estimate is biased.
Checking residuals auto-correlation
As already discussed the ACF plot is a superb technique to check autocorrelation within the series.
the ACF plot it’s possible to detect a high autocorrelation at lag 1. Due to this fact, this linear model is biased and it’s vital to repair this problem.
Cochrane-Orcutt to unravel autocorrelation in residuals
The Cochrane-Orcutt procedure is widely utilized in fMRI data evaluation to unravel this sort of problem [2]. On this specific case, the lag 1 autocorrelation within the residuals is critical, due to this fact, we will use the Cochrane–Orcutt formula for the autoregressive term AR(1).
# LAG 0
yt = y[2:180]
# LAG 1
yt1 = y[1:179]# calculate correlation coef. for lag 1
rho= np.corrcoef(yt,yt1)[0,1]
# Cochrane-Orcutt equation
Y2= yt - rho*yt1
X2 = x[2:180,1:] - rho*x[1:179,1:]
Fitting the transformed Model
Fitting the model again but after the Cochrane-Orcutt correction.
import statsmodels.api as sm#add constant to predictor variables
X2 = sm.add_constant(X2)
#fit linear regression model
model = sm.OLS(Y2,X2).fit()
#view model summary
print(model.summary())
params = model.params
Now the coefficient X1 isn’t statistically significant anymore, discarding the hypothesis that the duty impact the BOLD signal. The parameters standard error estimate modified significantly, which indicates the high impact of autocorrelation within the residuals to the estimation
Checking for autocorrelation again
This is sensible because it’s possible to indicate that the variance is at all times biased when there may be autocorrelation [1].
Now the autocorrelation of the residuals was removed and the estimate isn’t biased anymore. If we had ignored the autocorrelation within the residuals, we could consider the coefficient significant. Nonetheless, after removing the autocorrelation, seems that the parameter isn’t significant, avoiding a spurious inference that the duty is indeed related to signal.
Autocorrelation within the residuals of a General Linear Model can result in biased estimates, inefficient predictions, and invalid inferences. The applying of the Cochrane–Orcutt procedure to real-world fMRI data demonstrates its effectiveness in removing autocorrelation from residuals and avoiding false conclusions, ensuring the reliability of model parameters and the accuracy of inferences drawn from the evaluation.
Remarks
Cochrane-Orcutt is only one method to unravel autocorrelation within the residuals. Nonetheless, there are other to handle this problem resembling Hildreth-Lu Procedure and First Differences Procedure [1].