HOT cruise mean profiles

The HOT program has maintained nominally monthly cruises to Station ALOHA since October, 1988. Each cruise includes a 36-hour time series of 1000-dbar CTD casts, and one or two casts to the bottom. The full data set is available, but for now, let's look at a reduced set: cruise-averaged profiles.

http://www.soest.hawaii.edu/HOT_WOCE/ftp.html ftp://mananui.soest.hawaii.edu/pub/hot/ctd/aloha_mean/Readme.format

In [1]:
%matplotlib inline

import os
import time
from ftplib import FTP

import numpy as np
import matplotlib.pyplot as plt

import six
from six.moves import urllib # six is for Python 2/3 compatibility
from io import open  # more 2/3 compatibility

from pycurrents.codas import to_day
In [2]:
host = "mananui.soest.hawaii.edu"
directory = "/pub/hot/ctd/aloha_mean/"
urlbase = "ftp://" + host + directory

The data are in ascii files, one per cruise. We will download each file, read it into an array, and save the list of arrays. Then for subsequent access we can just load that list of arrays. It will actually be saved and loaded as an ndarray with the object dtype.

In [3]:
arrayfile = "hot_arrays.npy"

if not os.path.exists(arrayfile):
    # If you know you will run only on Python 3.3 or higher, you
    # don't need the first block here.  It would work also for
    # Python 3 but the newer syntax in the "else" block is
    # nicer, so I am leaving both in place.
    if six.PY2:
        ftp = FTP(host, "anonymous")
        tmp = ftp.cwd(directory)
        print(tmp)
        files = ftp.nlst()
        ftp.quit()
    else:    
        with FTP(host, "anonymous") as ftp:
            tmp = ftp.cwd(directory)
            print(tmp)
            files = ftp.nlst()

    files = [fn for fn in files if fn.endswith('.mn')]
    
    def num_from_filename(fn):
        digits = fn[3:].split('.')[0]
        return int(digits)

    files = sorted(files, key=num_from_filename)

    tempfn = "hot.tmp"
    arrays = []
    for fn in files:      
        urllib.request.urlretrieve(urlbase + fn, tempfn)
        arrays.append(np.loadtxt(tempfn, dtype=np.float32))
    np.save(arrayfile, arrays)

arrays = np.load(arrayfile)    

The first line of each file, and therefore the first row of each array, is a header. Its first 5 entries, from the Readme.format file referenced earlier, are:

         1-3      i3    HOT Cruise number
         4-5      i2    Station number
         6-7      i2    Year (last two digits)
         8-9      i2    Month
        10-12     i3    Day

Notice that this format was designed before people were routinely thinking about the Y2K problem.

Now gather the headers and unpack them into separate arrays for useful variables. Calculate a single time in days variable, and also time in years, for convenience in plotting the multi-year time series. (We could also use calendar date handling facilities in matplotlib, but for now it seems simpler to stick with floating point arrays.)

In [4]:
ncr = len(arrays)    
headers = np.zeros((ncr, 5), dtype=int)
for i, arr in enumerate(arrays):
    headers[i] = arr[0, :5]

cruisenumber = headers[:, 0]
station = headers[:, 1]   # This is always 2, Station ALOHA          
years = headers[:,2]
years[years > 70] += 1900
years[years < 70] += 2000
months = headers[:, 3]   # 1-based calendar month
days_in_month = headers[:, 4]  # calendar day in month
dday2000 = to_day(2000, years, months, days_in_month)
yeartime = 2000 + dday2000 / 365.25

Let's see how the data are distributed in time, and also check for missing cruises.

In [5]:
fig, axs = plt.subplots(nrows=2, sharex=True)
ax = axs[0]
ax.plot(yeartime, months, '+')
ax.set_ylabel('month')
ax.set_ylim([0.5, 12.5])

ax = axs[1]
ax.plot(yeartime[:-1], np.diff(cruisenumber), 'o')
ax.set_ylabel('cruise number diff')
Out[5]:
<matplotlib.text.Text at 0x107b8cc18>
In [6]:
fig, ax = plt.subplots()
bins = np.arange(0.5, 13, 1)
ax.hist(months, bins)
ax.set_ylabel('cruises per month')
ax.set_xlabel('month')
ax.set_ylim(0, 26)
ax.set_xlim(0, 13)
Out[6]:
(0, 13)

Enough about the distribution of cruises; time to start looking at the actual profiles.

The number of data fields is variable, but the first few are:

   Data Record Format:

        Column  Format  Item
         1-8    f8.1    Pressure (Decibars)
         9-18   f10.4   Temperature (Degrees Celsius,
                        International Temperature Scale of 1990)
        19-28   f10.4   Salinity (1978 International Practical Salinity Scale)
        29-36   f8.2    Oxygen (micromoles per kilogram)
        37-43   f7.3    Transmission (% transmission)
        44-51   f8.4    Chloropigments (microgram/liter)
        52-57   i6      Number of CTD casts averaged at this pressure level

Now we want to form a single 2-D array for each variable of interest, starting with temperature, salinity, and oxygen. The gridding is already uniform, 2-dbar intervals starting at 0, so we don't need to interpolate now. Later we might want to interpolate to fill some gaps.

In [7]:
pgrid = np.arange(0, 4900, 2)
nd = len(pgrid)

temp = np.ma.masked_all((ncr, nd), dtype=np.float32)
sal = np.ma.masked_all((ncr, nd), dtype=np.float32)
oxy = np.ma.masked_all((ncr, nd), dtype=np.float32)
navg = np.zeros((ncr, nd))
for i, arr in enumerate(arrays):
    n = arr.shape[0] - 1  # remove the header
    # The bad value is -9; but good values are all positive.
    temp[i, :n] = np.ma.masked_less(arr[1:, 1], 0)
    sal[i, :n] = np.ma.masked_less(arr[1:, 2], 0)
    oxy[i, :n] = np.ma.masked_less(arr[1:, 3], 0)
    if arr.shape[1] >= 7:
        navg[i, :n] = arr[1:, 6]
    else:
        # The hot 5 file is an oddball, as of 2015/02/02.
        navg[i, :n] = arr[1:, 4]
    

The first thing to do is always to look at the data. Let's make some simple plots of the top 1000 dbar.

In [8]:
dsl = slice(501)
p_top = pgrid[dsl]
t_top = temp[:, dsl]
s_top = sal[:, dsl]
o_top = oxy[:, dsl]

print(t_top.min(), t_top.max())
print(s_top.min(), s_top.max())
print(o_top.min(), o_top.max())
3.7485 27.4506
34.0139 35.5254
4.87 238.85

In [9]:
fig, axs = plt.subplots(nrows=3, sharex=True, sharey=True,
                        figsize=(8, 7))
tlevs = np.arange(4, 27.001, 1)
slevs = np.arange(34.1, 35.501, 0.1)
olevs = np.arange(0, 241, 20)

ax = axs[0]
cs = ax.contourf(yeartime, p_top, t_top.T, levels=tlevs, 
                 extend='both', cmap=plt.cm.Reds)
fig.colorbar(cs, ax=ax).set_label('Temperature')

ax = axs[1]
cs = ax.contourf(yeartime, p_top, s_top.T, levels=slevs, 
                 extend='both', cmap=plt.cm.Oranges)
fig.colorbar(cs, ax=ax).set_label('Salinity')

ax = axs[2]
cs = ax.contourf(yeartime, p_top, o_top.T, levels=olevs,
                 extend='both', cmap=plt.cm.Greens)
fig.colorbar(cs, ax=ax).set_label('Oxygen')

ax.invert_yaxis()

Another standard summary view is via a T-S plot. We will show it two ways, first cycling the color with the cruises, and second with the color based on oxygen.

In [10]:
fig, ax = plt.subplots()
ax.plot(s_top.T, t_top.T, marker='.', mec='none')
ax.set_xlabel('Salinity')
ax.set_ylabel('Temperature')
Out[10]:
<matplotlib.text.Text at 0x10c958358>
In [11]:
fig, ax = plt.subplots()
sc = ax.scatter(s_top, t_top, c=o_top, marker='.', cmap=plt.cm.Greens,
                linewidths=(0,))
ax.set_xlabel('Salinity')
ax.set_ylabel('Temperature')
cbar = fig.colorbar(sc, ax=ax, shrink=0.8, extend='max')
cbar.set_label('Oxygen ($\mu$mol/kg)')
ax.patch.set_color('0.8')  # gray background for contrast with low oxygen
ax.set_xlim(34.0, 35.6)
Out[11]:
(34.0, 35.6)

We can also look at variation with time at a given depth. Here is how we did it in class:

In [12]:
ip = 150
print(p_top[ip])

fig, axs = plt.subplots(nrows=3, sharex=True)
ax = axs[0]
ax.plot(yeartime, t_top[:, ip])

ax = axs[1]
ax.plot(yeartime, s_top[:, ip])

ax = axs[2]
ax.plot(yeartime, o_top[:, ip])
300

Out[12]:
[<matplotlib.lines.Line2D at 0x10774da20>]

Statistics for a given month

We will illustrate the technique of compiling and plotting statistics for a given month using January as an example.

In [13]:
month = 1
monthmask = months == month
month_t = t_top[monthmask]
month_s = s_top[monthmask]
month_o = o_top[monthmask]

Now we have much smaller 2-D arrays, with only the rows for the Januaries. Let's look at statistics for temperature. Notice that we are calculating the statistics along axis 0, the time axis--the first dimension of the array--so we get a 1-D array of values as a function of pressure.

In [14]:
fig, ax = plt.subplots()
ax.plot(month_t.mean(axis=0), p_top)
ax.invert_yaxis()

A nice way of plotting this together with the spread is to use a shaded band of +- one standard deviation.

In [15]:
mn = month_t.mean(axis=0)
std = month_t.std(axis=0)

fig, ax = plt.subplots()
ax.plot(mn, p_top)
ax.fill_betweenx(p_top, mn - std, mn + std, color='b', alpha=0.2)
ax.invert_yaxis()

Notice the use of alpha, which is opacity on a scale of 0 (transparent) to 1 (opaque). It is lightening the color because the background is showing through. Alternatively, we could choose a lighter shade of blue. Alpha is especially useful when plotting data sets one on top of another.

A limitation of this method of showing the std is that it gives a misleading impression in the thermocline. The eye tends to be drawn to the thickness of the light blue band, but all the uncertainty here is in the temperature, not the pressure. Horizontal line errorbars at a few locations would help to clarify this. There is an errorbar() method you could use for this, with slice indexing in depth to select points every 100 or 200 dbars. It would probably be a good idea to just plot the standard deviation as a profile, also.

To calculate the standard error of the mean (usually abbreviated SEM), you will need to know the number of data points, and this will not always be uniform with depth. Remember, some points are invalid and have been masked out. There won't be many such for temperature, but in any case, here is how to calculate (and plot) the number of valid points as a function of depth when using masked arrays:

In [16]:
npts = month_t.count(axis=0)  # masked array "count()" method

fig, ax = plt.subplots()
ax.plot(npts, p_top)
ax.invert_yaxis()

That's a boring case; at least for temperature in January, there were no profiles with gaps in the top 1000 dbars. When you do your calculation of the SEM, though, just use the npts array calculated this way for each month.

The code in this section is intended only to give you hints about how to use some tools; now it is up to you to apply them to Homework 2.

In [16]: