Surface wave statistics

This is a head start for Homework 1, items 2 and 3, due February 3.

Two time series of sea-surface height (SSH) sampled at 1 Hz are provided in a Matlab binary file. The assignment requires first looking at the distribution of SSH itself, and then looking at wave heights, which are peak-to-trough distances. It is suggested that the SSH distribution might be approximately Gaussian, while the heights, which are of course positive-definite, might fit a Rayleigh distribution.

In [1]:
%matplotlib inline

import os
from six.moves import urllib # six is for Python 2/3 compatibility

import numpy as np
import matplotlib.pyplot as plt

# Direct access to Matlab binary file contents
from scipy.io import loadmat

# My more convenient access to Matlab files
from pycurrents.file import matfile
In [2]:
urlbase =  "http://currents.soest.hawaii.edu/ocn_data_analysis_files/data/"
fn = "wave_kahului.mat"

# We will only download a file if we don't already have it.
if not os.path.exists(fn):
    urllib.request.urlretrieve(urlbase + fn, fn)
    print("Retrieved: ", fn)
else:
    print("Found: ", fn)
Found:  wave_kahului.mat

The core reader in Python for Matlab binary files is scipy.io.loadmat. For very simple files it works fine as-is, but for files with more complex internal structure it is awkward. Therefore I wrote a wrapper, loadmatbunch--a function that uses scipy.io.loadmat and then manipulates the data structures to make them easier to use. Let's see what we get first with plain loadmat.

In [3]:
waves = loadmat(fn)
print(waves.keys())
eta1 = waves['eta1']
eta2 = waves['eta2']
print('eta1 is a', type(eta1))
print(eta1.shape, eta1.dtype)
print('eta2 is a', type(eta2))
print(eta2.shape, eta2.dtype)
dict_keys(['eta2', '__version__', '__header__', '__globals__', 'eta1'])
eta1 is a <class 'numpy.ndarray'>
(8192, 1) >f8
eta2 is a <class 'numpy.ndarray'>
(8192, 1) >f8

We see that loadmat has returned a dictionary with two variables, each of which is a 2-D array with a single column. Matlab has stored the 1-D data as 2-D because it doesn't know any better--everything is a matrix. If you want loadmat to to fix this, so that it returns the 1-D array one would expect in Python, you can use the squeeze_me=True kwarg. There are many kwargs that you can use to modify the way that loadmat translates from Matlab to Python. It can be confusing. Therefore I wrote loadmatbunch, which generally does what I want by default. It returns a data container class called a Bunch; if the matfile contains structured data, then the result will be a nested set of Bunches. Don't worry about that for now; all we need is the simplest behavior, which will be made clear by example.

First, let's use showmatbunch to display the contents of the file. It returns a string, so we need to print its output:

In [4]:
print(matfile.showmatbunch(fn))
eta1  : ndarray, shape (8192,), dtype >f8
eta2  : ndarray, shape (8192,), dtype >f8


Now we know what to expect. Let's load the file.

In [5]:
dat = matfile.loadmatbunch(fn)
# A Bunch acts like a dictionary:
print(dat.keys())
print(dat['eta1'].shape, dat['eta1'].dtype)

# but items can also be accessed using dot notation:
print(dat.eta1[:5])
print(dat.eta1.shape, dat.eta1.dtype)
dict_keys(['eta2', 'eta1'])
(8192,) >f8
[-0.07410645 -0.07670645 -0.08030645 -0.06090645 -0.01410645]
(8192,) >f8

This is convenient. The 1-D arrays really are 1-D now, and we can access them either by key (dictionary style) or as attributes (class style).

At this point you should plot the data. I will leave that to you. I suggest plotting a short chunk so that you can resolve the individual waves.

In [6]:
# put your plotting code here...

Now you have the two SSH time series at your fingertips, you know what they look like, and you know from an earlier lesson how to estimate and plot the PDFs. You can proceed with item 2 in the homework.

Wave heights

To calculate the wave heights, you have to divide the time series up into "waves". One way to do this is to define a single wave as the interval between crossings from less than zero to greater than zero. This only makes sense, of course, if the data are first high-pass filtered to remove tides and lower frequency SSH variations. I think the data supplied here are already sufficiently filtered, so you don't have to do your own filtering; about all you could do anyway is to subtract the record mean and trend, since the record is too short to show the tides.

How can we divide the record up into waves based on zero-crossings? Suppose we generate a time series that is 0 where SSH < 0 and 1 elsewhere. Then a transition from 0 to 1 is an upward zero-crossing, which is what we want. We find those transitions by taking the difference of consecutive elements. Let's illustrate with a simple sinusoidal signal.

In [7]:
x = np.linspace(0, 50, 500)
y = np.sin(x)
# Notice that in the following we have to convert
# the Boolean array to an integer; any kind would do.
is_positive = (y >= 0).astype(np.int8)
up = np.diff(is_positive) == 1  # Boolean array
# Above this point, all arrays have the same dimensions.
# Below we find the indices of the True values in up.
i_up = np.nonzero(up)[0]
print(i_up)
[ 62 125 188 250 313 376 438]

I think most of the above should be clear with a little thought. The exception is the call to np.nonzero; that line is doing the equivalent to the Matlab find function. Refer to the documentation for np.nonzero. It is not needed very often; most of the time, it is more convenient to work with indexing directly with Boolean arrays.

Now check to see that our selection of zero-crossings is reasonable:

In [8]:
fig, ax = plt.subplots()
ax.plot(x, y)
ax.plot(x[i_up], y[i_up], 'gx', ms=20)
Out[8]:
[<matplotlib.lines.Line2D at 0x108567cf8>]

We can see that the boundary points are biased one point to the left. We could correct this by adding 1 to each boundary, and then checking to make sure the last boundary is still within the domain, but so long as the waves are reasonably well resolved, it will not have much effect on the calculation of heights.

Now we have a set of start and end points for the "waves". Let's cycle through them, and use the np.ptp (peak-to-peak) function to find the heights.

In [9]:
heights = []
for i0, i1 in zip(i_up[:-1], i_up[1:]):
    heights.append(np.ptp(y[i0:i1]))
print(heights)    
[1.9989144099118685, 1.9989796315828405, 1.9996790070842905, 1.9986453654421283, 1.9988147474998406, 1.9996053386059245]

Looks good. Do you understand why all of these values are a little bit less than 2?

In [9]: