In this series of post I will discuss how to assess the similarity between a pair of time series using frequentists methods. I.e. all these methods provide at the end a p-value and a similarity score (e.g. a correlation) using parametric and/or non-parametric approaches. Most explanations below are intuitive, you can read the details in signal processing and random processes books. I will mostly show examples using Pearson’s correlation this is because: 1) Pearson’s r is equal to the beta coefficient of the linear regression obtained between two signals with zero mean and variance 1 (i.e. z-scored signals). 2) if you think that Pearson’s is not good because you have few extreme spikes (= noise) in your data, then I’d rather remove the spikes (e.g. with a median sliding window filter or with linear regression) and work with Pearson. However if spikes are actually part of your signal of interest, then you need to work with Spearman’s r and it is not difficult to adapt the below to the Spearman’s case.
1. Similarity between two “white” signals
If the time series we are dealing with are “basically white” (i.e. the time points are a gaussian process with no relationship between time points) then a simple correlation and the parametric p value obtained by it are enough to claim statistical similarity.
rng(0) T=1000; % number of time points x = randn(T,1); y = randn(T,1); [r p]=corr(x,y)
Which if you copy paste it to Matlab will give:
r = -0.0093 p = 0.7695
How to assess that the signal you are dealing with is white? There are tests to see if the points can be well fit into a gaussian distribution (Enrico to add link to examples here). A white signal means that the frequency power spectrum (i.e. power spectral density PSD) is roughly a flat line (i.e. all frequencies have approximately the same power). However, the inverse Fourier transform of the PSD is the autocorrelation of the time series. As the PSD is a flat line, then the autocorrelation must be (approximately) a Dirac delta function, i.e. a “stick” function that is different from zero at the origin and “almost zero” in other time points.
Autocorrelation
Put it intuitively, every real signal is an autocorrelated signal and autocorrelation means information. That means that signals measured from nature tend to have a 1/f power spectrum: this means that what happens at a specific time point, is very correlated to what happens to the next (= immediately following) time point. Some signals are faster than other so the “immediately following” is also know as sampling frequency and you want to make sure that it is fast enough to keep most of the frequencies of the signals within your sampling range (go read Shannon’s theorem if you haven’t heard it before).
Long story short, you can estimate the autocorrelation of a time series by computing the cross correlation between the signal and itself lagged. Repeat for multiple lags and you get the autocorrelation plot. With Matlab:
subplot(2,1,1) autocorr(x) subplot(2,1,2) autocorr(y)
And indeed we see that the two time series have a correlation that basically looks like a Dirac delta, i.e. maximum correlation with lag = 0 (obvious result) and very tiny correlation with lag different from 0.
2. Relationship between r, p, t-distribution, autocorrelation, and degrees of freedom
Now to understand how the p value is computed from an r value parametrically, one has to look at the distribution used. Matlab uses this kind of mapping:
Code: https://version.aalto.fi/gitlab/BML/bramila/blob/master/external/sandbox/pvalPearson.m
function p = pvalPearson(tail, rho, n) %PVALPEARSON Tail probability for Pearson's linear correlation. % p = pvalPearson(tail, rho, n) % tail = 'b' 'r' 'l' t = sign(rho) .* Inf; k = (abs(rho) < 1); t(k) = rho(k).*sqrt((n-2)./(1-rho(k).^2)); switch tail case 'b' % 'both or 'ne' p = 2*tcdf(-abs(t),n-2); case 'r' % 'right' or 'gt' p = tcdf(-t,n-2); case 'l' % 'left' or 'lt' p = tcdf(t,n-2); end
Where n is the total number of time points which corresponds to n-2 degrees of freedom. So if your signals are white, indeed the degrees of freedom in your data are very close to the number of time points since each time points is independent from the other time points. But real signals are autocorrelated, which is like saying intuitively that even if you downsampled the signal (e.g. keeping 1 time point every 10) you would still retain most of the information.
This means that if your signals are not white (i.e. all the signals you sample in real life) then the p value returned by the corr function is wrong because it overestimates the degrees of freedom used in the t-distribution. How to solve this problem? One solution is to estimate the actual degrees of freedom of an autocorrelated signal.
Prof. Petri Toiviainen spent some time digging the literature for a paper we were coauthoring years ago, and found this gem in the Canadian journal of Fisheries and Aquatic Sciences “Comparison of methods to account for autocorrelation in correlation analyses of fish data” and summed it up in the Appendix B of our paper.
You can test the formula yourself with the code I wrote here: https://version.aalto.fi/gitlab/BML/bramila/blob/master/bramila_autocorr.m
using our white signals above:
bramila_autocorr(x,y) ans = 999.3657
Indeed the cross correlation between the two time series has the number of degrees of freedom almost equal as the total number of time points T.
Example with two autocorrelated signals
Here a final example using two autocorrelated time series and how to compute the correct p-value parametrically using the estimation of the degrees of freedom with the t-distribution function.
T=1000; rng(0) x=zscore(cumsum(randn(T,1))); y=zscore(cumsum(randn(T,1))); [r p]=corr(x,y) df=bramila_autocorr(x,y) p_actual = pvalPearson('b', r, df+2)
which produces
r = -0.5026 p = 3.8770e-65 df = 8.0402 p_actual = 0.1376
As we can see, the WRONG p-value obtained by blindingly applying Matlab’s corr function would tell us that the two time series are almost identical with a minuscle p value of 10^-65. In reality the degrees of freedom between time series are just 8 which makes the p-value non significant (as it should be since they are just coming from random noise).
In the next parts I will show how to obtain p-values without recurring to parametric distributions as I showed here. Specifically one can generate surrogate signals that have properties that are similar to the original signals (e.g. same autocorrelation) but in which the temporal order of events is destroyed. Examples are: resampling in frequency by randomising the phase part of the fourier transformed signal, circular shifting in time, block resampling in time.
great post! looking forward to the next :)
LikeLike