Spectral analysis part II¶

So far we have looked at spectra for a single time series of a scalar or a vector. Often, however, we want to look at the relationship between two different time series. In the time domain, this can be done with a lagged covariance or correlation; in the frequency domain, these map to the the cross spectrum and the coherence.

%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt

import scipy.stats as ss
from scipy import signal

from pycurrents.num import spectra

plt.rcParams['figure.dpi'] = 80

We can re-use the datafaker from the previous notebook.

def datafaker(nt, dt=1, freqs=None, color='w',
              amp=1, 
              complex=True,
              repeatable=True):
    """
    Generate fake data with optional sinusoids (all the
    same amplitude) and with red, white, or blue noise
    of arbitrary amplitude.
    
    *nt* : number of points
    *dt* : time increment in arbitrary time units
    *freqs* : None, or a sequence of frequencies in
        cycles per unit time. 
    *color* : 'r', 'w', 'b'
    *amp* : amplitude of red, white, or blue noise
    *complex* : True, False
    *repeatable* : True, False

    Returns t, x
    """
    if repeatable:
        np.random.seed(1)    
    noise = np.random.randn(nt + 1) + 1j * np.random.randn(nt + 1)
    
    if color == 'r':
        noise = np.cumsum(noise) / 10 
        noise -= noise.mean()
    elif color == 'b':
        noise = np.diff(noise)
    noise = noise[:nt]
    x = amp * noise

    t = np.arange(nt, dtype=float) * dt
    
    for f in freqs:
        sinusoid = np.exp(2 * np.pi * 1j * f * t)
        x += sinusoid
    if not complex:
        x = np.real(x)
        
    return t, x

Let's simulate a pair of related time series. They will have a random component (white noise) and a tide. We will chop n_shift off the start of the first replicate and off the end of the second, so the second will lag the first. To visualize this, let the time of each point in the series prior to chopping be an integer, 0-9. Chopping n_shift =2 points off the start of the first and off the end of the second leaves:

    2 3 4 5 6 7 8 9
    0 1 2 3 4 5 6 7

Now, going from left to right (forward in time) we encounter a given number first in the first series, and two intervals later in the second; the second lags the first by 2 points.

To help distinguish between them, we will give the second curve a larger amplitude in the function below.

def two_series(n_shift, nt, dt=1, freqs=None,
              amp=1, 
              complex=True,
              repeatable=True):
    if repeatable:
        np.random.seed(100)
    # shared random part    
    noise = 1.5 * amp * np.random.randn(nt + n_shift)
    # shared harmonic behavior
    t0, s0 = datafaker(nt + n_shift, dt=dt, freqs=freqs,
                       amp=0, complex=complex)
    t = t0[:-n_shift]
    s1 = s0[n_shift:] + noise[n_shift:] + amp * np.random.randn(nt)
    s2 = s0[:-n_shift] + noise[:-n_shift] + amp * np.random.randn(nt)

    s1 -= s1.mean()
    s2 -= s2.mean()
    return t, s1, s2 * 1.5

First, consider a pair of series with no noise and a single frequency.

t, x1, x2 = two_series(8, 100, dt=1/4, freqs=[1/12], amp=0, 
                       complex=False)
fig, ax = plt.subplots()
ax.plot(t, x1, label='series 1')
ax.plot(t, x2, label='series 2')
ax.set_xlabel('hours')
ax.set_ylabel('height (meters)')
ax.set_title('Series 2 lags series 1')
ax.legend()

<matplotlib.legend.Legend at 0x1123639e8>

Each extremum in 2 occurs 2 hours after the similar extremum in 1.

Now let's make a longer series including noise and two semidiurnal tidal constituents.

t, x1, x2 = two_series(3, 600, freqs=[1/12.42, 1/12],
                       amp=0.5,
                       complex=False,
                       repeatable=False)

We will want to plot the covariance of two zero-mean series, $x$ and $y$, as a function of lag, $u$. It can be defined as $$ C_{xy}(u) = \frac{1}{N}\sum_m x(m)\, y(m+u) = \frac{1}{N}\sum_m x(m-u)\, y(m) $$ where we understand m as being taken over the negative to positive range for which the shifted arrays overlap. $N$ is the number of points in $x$ and $y$. Notice that this is the estimator with a built-in taper to zero at large lag (small overlap).

Looking at the first form above we see that if $u$ is positive, $x$ at a given time, $m$, is being multiplied by $y$ at a later time, $m+u$, so $m$ is the amount by which $y$ lags $x$.

The convolution operation, indicated by the asterisk, is similar, but one of the series is reversed. Numpy convolve(x, y) is defined as $$ (x * y)[u] = \sum_m x(m)\, y(u - m) $$ so to use it to calculate the covariance we need to swap the order of the arguments, reverse the second one, and divide by $N$; in other words, with mixed notation, $$C_{xy}(u) = \frac{1}{N}(y * x[::-1]).$$

crosscov12 = signal.fftconvolve(x2, x1[::-1], mode='full') / len(t)
lags = np.arange(-len(t)+1, len(t))

fig, axs = plt.subplots(2)
fig.subplots_adjust(hspace=0.3)
axs[0].plot(t, x1, label='x1')
axs[0].plot(t, x2, label='x2')
axs[0].set_xlabel('hours')
axs[0].legend(loc='upper center', ncol=2)

axs[1].plot(lags, crosscov12)
axs[1].set_xlabel('lag of x2 relative to x1');

In the top plot above, the blue curve is the first series, and it's peaks occur 3 hours before the orange peaks, so it leads in time, as intended. The peak covariance is at a positive lag of 3 hours, consistent with x2 lagging x1.

Now switch to the frequency domain. We will look at the spectrum of each time series individually, and then at the squared coherence and phase to see the relationship between the series as a function of frequency.

The cross spectrum (cross-spectral density; product per unit frequency) is $$S_{xy}(f) = \frac{\Delta t}{N}X^*(f)Y(f) = L_{xy}(f) - iQ{xy}(f)$$ where $X(f)$, $Y(f)$ are the Fourier transforms of $x$ and $y$, and the asterisk indicates the complex conjugate. The real and imaginary parts, $L$ and $Q$, are the co-spectrum and quadrature spectrum. The cross spectrum can be expressed more usefully in polar form. $$S_{xy}(f) = A_{xy}(f) \exp(i\phi_{xy}(f))$$ The squared coherency is then a normalized amplitude, with $\phi$ the coherence phase. $$\gamma_{xy}^2(f) = \frac{A_{xy}^2(f)}{A_{xx}(f)A_{yy}(f)}$$

Note the similarity between squared coherency in the frequency domain and squared correlation coefficient in the time domain.

A smoothing or segment-averaging step is essential for the calculation of coherence. If the cross-spectrum has not been smoothed or ensemble-averaged, the coherence will be unity. Experiment with the nsmooth parameter in the following to see how the various spectral estimates depend on it. Try odd-numbered values from 1 through 13.

nsmooth=7
spec = spectra.spectrum(x1, x2, dt=1/24, nfft=None, 
                        smooth=nsmooth, window='quadratic')
print(spec.keys())

dict_keys(['smooth', 'cohsq', 'overlap', 'nfft', 'psd_x', 'seg_starts', 'phase', 'window', 'axis', 'detrend', 'freq_boundaries', 'freqs', 'psd_xy', 'psd_y'])

To calculate the 95% confidence level for squared coherence we may use the formula below, based on (134) in Mark's notes, which is taken from Emery and Thompson (2004). It is using the fact that nsmooth is twice the raw number of degrees of freedom. If we were using segment-averaging, the number of degrees of freedom would be twice the number of effectively independent segments, taking into account the overlap.

# squared coherence confidence level
if nsmooth <= 1:
    c95 = np.nan
else:    
    c95 = 1 - 0.05 ** (1 / (nsmooth - 1))

fig, axs = plt.subplots(nrows=3, sharex=True)
ax = axs[0]
ax.loglog(spec.freqs, spec.psd_x, 'b',
          spec.freqs, spec.psd_y, 'r')
ax.set_ylabel('PSD')
ax = axs[1]
ax.semilogx(spec.freqs, spec.cohsq)
ax.set_ylabel('coh$^2$')
ax.set_ylim(0, 1)
ax.axhline(c95, color='pink')
ax = axs[2]
ax.semilogx(spec.freqs, np.rad2deg(spec.phase), 'o')
ax.set_ylim(-180, 180)
ax.set_ylabel('phase')
ax.set_xlim(spec.freqs[nsmooth], spec.freqs[-nsmooth])
ax.set_yticks([-90, 0, 90])
ax.set_xlabel('CPD')

Text(0.5,0,'CPD')

Here we see the strong coherence at the tidal peak, with a phase of 90 degrees, consistent with the 3-hour lag. Coherence is noisy elsewhere, but we can discern a pattern of phase increasing with frequency. This is consistent with a constant time lag; the higher the frequency, the greater the phase shift for a given time lag. To see this phase shift more clearly, use a linear frequency scale.

fig, axs = plt.subplots(nrows=3, sharex=True)
ax = axs[0]
ax.semilogy(spec.freqs, spec.psd_x, 'b',
          spec.freqs, spec.psd_y, 'r')
ax.set_ylabel('PSD')
ax = axs[1]
ax.plot(spec.freqs, spec.cohsq)
ax.set_ylabel('coh$^2$')
ax.set_ylim(0, 1)
ax.axhline(c95, color='pink')
ax = axs[2]
ax.plot(spec.freqs, np.rad2deg(spec.phase), 'o')
ax.set_ylim(-180, 180)
ax.set_ylabel('phase')
ax.set_xlim(spec.freqs[nsmooth], spec.freqs[-nsmooth])
ax.set_yticks([-90, 0, 90])
ax.set_xlabel('CPD')

Text(0.5,0,'CPD')

Above, we see that the phase shift is linear in the frequency. This linear relationship results from the time shift between the two series, and it extends over the entire frequency range because the white noise signal that we added to both series spans the entire frequency range.

Note that the coherence phase gives the lag of the first series relative to the second, which is the opposite of the sign convention we used for the lagged correlation.