Home Artificial Intelligence My Life Stats: I Tracked My Habits for a 12 months, and This Is What I Learned Why? Just why would I do that? And why would this matter to you? The strategy — what did I do and the way did I do it? Initial exploration of the information Correlation study Time Series studies — ARIMA models FFT — Fast Fourier Transform Conclusions References

My Life Stats: I Tracked My Habits for a 12 months, and This Is What I Learned Why? Just why would I do that? And why would this matter to you? The strategy — what did I do and the way did I do it? Initial exploration of the information Correlation study Time Series studies — ARIMA models FFT — Fast Fourier Transform Conclusions References

0
My Life Stats: I Tracked My Habits for a 12 months, and This Is What I Learned
Why? Just why would I do that?
And why would this matter to you?
The strategy — what did I do and the way did I do it?
Initial exploration of the information
Correlation study
Time Series studies — ARIMA models
FFT — Fast Fourier Transform
Conclusions
References

I first checked out the person time series for 4 variables: Sleep, Studying, Socializing and Mood. I used Microsoft Excel to quickly draw some plots. They represent the every day variety of hours spent (blue) and the moving average¹ for five days MA(5) (red) which I considered to be a superb measure for my situation. The mood variable was rated from 10 (the best!) to 0 (awful!).

Regarding the information contained within the footnote of every plot: the total is the sum of the values of the series, the mean is the arithmetic mean of the series, the STD is the usual deviation and the relative deviation is the STD divided by the mean.

Total: 2361h. Mean: 7,1h. STD: 1,1h. Relative deviation: 15.5% (image by creator).

All things accounted for, I did well enough with sleep. I had rough days, like everyone else, but I believe the trend is pretty stable. In reality, it’s considered one of the least-varying of my study.

Total: 589,1h. Mean: 1,8h. STD: 2,2. Relative deviation: 122% (image by creator).

These are the hours I dedicated to my academic profession. It fluctuates lots — finding balance between work and studying often means having to cram projects on the weekends — but still, I consider myself satisfied with it.

Total: 1440,9h. Mean: 4,3h. STD: 4,7h. Relative deviation: 107% (image by creator).

Regarding this table, all I can say is that I’m surprised. The grand total is bigger than I expected, provided that I’m an introvert. In fact, hours with my colleagues at school also count. When it comes to variability, the STD is de facto high, which is smart given the issue of getting a stablished routine regarding socializing.

Mean: 8,0h. STD: 0,9h. Relative deviation: 11,3% (image by creator).

This the least variable series — the relative deviation is the bottom amongst my studied variables. A priori, I’m satisfied with the observed trend. I believe it’s positive to maintain a reasonably stable mood — and even higher if it’s a superb one.

After taking a look at the trends for the principal variables, I made a decision to dive deeper and study the potential correlations² between them. Since my goal was with the ability to mathematically model and predict (or a minimum of explain) “Mood”, correlations were a crucial metric to think about. From them, I could extract relationships like the next: “the times that I study essentially the most are those that I sleep the least”, “I often study languages and music together”, etc.

Before we do the rest, let’s open up a python file and import some key libraries from series evaluation. I normally use aliases for them, because it is a standard practice and makes things less verbose within the actual code.

import pandas as pd               #1.4.4
import numpy as np #1.22.4
import seaborn as sns #0.12.0
import matplotlib.pyplot as plt #3.5.2
from pmdarima import arima #2.0.4

We are going to make two different studies regarding correlation. We are going to look into the Person Correlation Coefficient³ (for linear relationships between variables) and the Spearman Correlation Coefficient⁴ (which studies monotonic relationships between variables). We shall be using their implementation⁵ in pandas.

Pearson Correlation matrix

The Pearson Correlation Coefficient between two variables X and Y is computed as follows:

where cov is the covariance, sigma X is std(X) and sigma Y is std(Y)

We will quickly calculate a correlation matrix, where every possible pairwise correlation is computed.

#read, select and normalize the information
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')

#compute the correlation matrix
corr = numerics.corr(method='pearson')

#generate the heatmap
sns.heatmap(corr, annot=True)

#draw the plot
plt.show()

That is the raw Pearson Correlation matrix obtained from my data.

Pearson Correlation matrix for my variables (image by creator).

And these are the numerous values⁶ — those which might be, with a 95% confidence, different from zero. We perform a t-test⁷ with the next formula. For every correlation value rho, we discard it if:

where n is the sample size. We will recycle the code from before and add on this filter.

#constants
N=332 #variety of samples
STEST = 2/np.sqrt(N)

def significance_pearson(val):
if np.abs(val)return True
return False

#read data
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')

#calculate correlation
corr = numerics.corr(method='pearson')

#prepare masks
mask = corr.copy().applymap(significance_pearson)
mask2 = np.triu(np.ones_like(corr, dtype=bool)) #remove upper triangle
mask_comb = np.logical_or(mask, mask2)

c = sns.heatmap(corr, annot=True, mask=mask_comb)
c.set_xticklabels(c.get_xticklabels(), rotation=-45)
plt.show()

Those which were discarded could just be noise, and wrongfully represent trends or relationships. In any case, it’s higher to assume a real relationship is meaningless than consider meaningful one which isn’t (what we consult with as error type II being favored over error type I). This is particularly true in a study with relatively subjective measurments.

Filtered Pearson Correlation matrix. Non-significant values (and the upper triangular) have been filtered out. (image by creator)

Spearman’s rank correlation coefficient

The spearman correlation coefficient may be calculated as follows:

where R indicates the rank variable⁸ — the remainder of variables are the identical ones as described within the Pearson coef.

As we did before, we are able to quickly compute the correlation matrix:

#read, select and normalize the information
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')

#compute the correlation matrix
corr = numerics.corr(method='spearman') #concentrate to this variation!

#generate the heatmap
sns.heatmap(corr, annot=True)

#draw the plot
plt.show()

That is the raw Spearman’s Rank Correlation matrix obtained from my data:

Spearman Correlation matrix for my variables (image by creator).

Let’s see what values are literally significant. The formula to envision for significance is the next:

where r is spearman’s coefficient. Here, t follows a t-student distribution with n-2 degrees of freedom.

Here, we’ll filter out all t-values higher (in absolute value) than 1.96. Again, the explanation they’ve been discarded is that we will not be sure whether or not they are noise — random probability — or an actual trend. Let’s code it up:

#constants
N=332 #variety of samples
TTEST = 1.96

def significance_spearman(val):
if val==1:
return True
t = val * np.sqrt((N-2)/(1-val*val))
if np.abs(t)<1.96:
return True
return False

#read data
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')

#calculate correlation
corr = numerics.corr(method='spearman')

#prepare masks
mask = corr.copy().applymap(significance_spearman)
mask2 = np.triu(np.ones_like(corr, dtype=bool)) #remove upper triangle
mask_comb = np.logical_or(mask, mask2)

#plot the outcomes
c = sns.heatmap(corr, annot=True, mask=mask_comb)
c.set_xticklabels(c.get_xticklabels(), rotation=-45)
plt.show()

These are the numerous values.

Correlation Matrix with significant values. (image by creator)

I consider this chart higher explains the apparent relationships between variables, as its criterion is more “natural” (it considers monotonic⁹, and never only linear, functions and relationships). It’s not as impacted by outliers as the opposite one (a few very bad days related to a certain variable won’t impact the general correlation coefficient).

Still, I’ll leave each charts for the reader to evaluate and extract their very own conclusions.

LEAVE A REPLY

Please enter your comment!
Please enter your name here