Missing data: masked arrays

When working with real oceanographic data sets, there are often gaps. One way of handling them is to fill them with NaN, and then write functions that treat NaNs however we desire. For example, we might want calculate the mean of all values that are not NaN. In some such cases, there is already a numpy function to do this type of calculation: numpy now includes nanmean, nanmax, nanmin, nanargmax, nanargmin, nanstd, nanvar, and nansum. The use of NaN as a bad value flag is typical in Matlab code.

Numpy, however, provides an alternative way to handle missing data: the numpy.ma module, with its MaskedArray subclass of the fundamental numpy.ndarray class. There are a few rough edges in numpy.ma, but it has some substantial advantages over relying on NaN, so I use it extensively. It is well supported in Matplotlib, and is used by default in the netCDF4 package.

Advantages of masked arrays include:

  • They work with any type of data, not just with floating point.
  • They can lead to simpler, more concise code.

Regardless of the degree to which you end up using masked arrays in your own code, you will encounter them, so you need to know at least a few things about them. This notebook barely scratches the surface. For a thorough introduction, see the numpy reference docs for ma. And, of course, check the docstrings directly as you are reading or writing code.

The bare minimum: conversion to ndarray

Suppose you are using a library that reads a file (e.g., netCDF) and returns the results as a masked array. Or, perhaps it might return either an ndarray or a masked array. Further, suppose you want to do your work with NaN-filled ndarrays. Here is an example of how you can ensure you end up with a floating point array in which np.nan replaces originally masked values, if any:

In [1]:
import numpy as np

# Later we will use random number sequences.
rng = np.random.default_rng(2020)

# make an example of an 8-bit integer masked array:
x = np.ma.array([1, 100, 2, 3], mask=[False, True, False, False],
                dtype=np.int8)
print("The input integer masked array is:")
print(x)

xnan = np.ma.filled(x.astype(float), np.nan)
print("Converted to double precision, with nan:")
print(xnan)
The input integer masked array is:
[1 -- 2 3]
Converted to double precision, with nan:
[ 1. nan  2.  3.]

We first used the astype(float) method call to generate a double precision array. (In this case we could also use single precision with no loss of accuracy.) This method is available to ndarrays and to masked arrays, so it would work even if x were an ndarray. Next, this floating point array is used as the first argument to the np.ma.filled function, which returns an ndarray of the same dtype, but with its second argument used to replace the masked values. If its first argument is already an ndarray, np.ma.filled returns that argument unchanged.

There are other ways of accomplishing this nan-filled conversion, sometimes more efficiently (that is, without copying the data unnecessarily), but the method above is adequate for now.

Taking advantage of masked arrays

Now let's see how we can take advantage of masked arrays instead of immediately converting them into ndarrays.

If we have input that might be a masked array or an ndarray, possibly with NaN and/or inf values, we can start by converting, if necessary, to a masked array:

In [2]:
# sample input ndarray:
x = np.array([1.0, 2.5, np.nan, 1.3, np.inf, 7.2])
print("input array with bad values:")
print(x)

xm = np.ma.masked_invalid(x)
print("masked version:")
print(xm)
input array with bad values:
[1.  2.5 nan 1.3 inf 7.2]
masked version:
[1.0 2.5 -- 1.3 -- 7.2]

The masked array has nearly all of the methods that an ndarray has, and a few special ones of its own. For example, to find out how many unmasked values it contains, there is the count method:

In [3]:
print("xm has", xm.count(), "unmasked values")
xm has 4 unmasked values

To extract an ndarray containing only the unmasked values, use the compressed method:

In [4]:
print("unmasked values are", xm.compressed())
unmasked values are [1.  2.5 1.3 7.2]

For both ndarrays and masked arrays, there are often functions that correspond to methods, and vice versa. An advantage of using methods is that they inherently "do the right thing"--the method of a masked array includes functionality to deal with the mask. With both methods and functions, it is not always obvious when the returned object will be an ndarray and when it will be a masked array, so when it matters it is wise to check, either with a test or by reading the documentation.

Sometimes it is useful to extract the mask, perhaps to use for masking another variable. Use the np.ma.getmaskarray function to get a full-size boolean mask corresponding to a given array, masked or not:

In [5]:
x = np.arange(12).reshape(3, 4)
print("sample ndarray, x:")
print(x)
print("\nnp.ma.getmaskarray(x):")
print(np.ma.getmaskarray(x))
sample ndarray, x:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

np.ma.getmaskarray(x):
[[False False False False]
 [False False False False]
 [False False False False]]

When a masked array is created with no masked values, it's mask attribute does not contain a full Boolean array; this is to save space and time, in case it turns out that nothing ever needs to be masked:

In [6]:
xm = np.ma.arange(6).reshape(2, 3)
print("fresh masked array, xm:")
print(xm)
print("\nxm.mask is actually np.ma.nomask, but prints as:")
print(xm.mask)
fresh masked array, xm:
[[0 1 2]
 [3 4 5]]

xm.mask is actually np.ma.nomask, but prints as:
False

Masking values in a masked array

We have already seen one way of ending up with masked values: using np.ma.masked_invalid. There are many more, e.g.:

In [7]:
x = np.arange(10)
xm = np.ma.masked_greater(x, 5)
print(xm)
[0 1 2 3 4 5 -- -- -- --]

Mask a set of values in an existing masked array using indexing:

In [8]:
xm = np.ma.arange(5)
xm[[1, 3]] = np.ma.masked
print(xm)
[0 -- 2 -- 4]

Math

For operations like addition and multiplication, a masked value acts like a NaN: the output is masked. Division is more interesting: division by zero yields a masked value, not an error:

In [9]:
x = np.ma.array([1, 2, 3], mask=[False, False, True])
y = np.ma.array([1, 0, 1])
print(x * y)
print(x/y)
[1 0 --]
[1.0 -- --]

Similarly, evaluating a function with arguments outside the domain of that function simply yields masked values:

In [10]:
print(np.ma.arcsin([0.8, 1, 1.5]))
[0.9272952180016123 1.5707963267948966 --]

Compare the NaN alternative

In the examples above, a mathematical operation silently masks the locations of invalid results. What happens without masked arrays?

In [11]:
np.arcsin([0.8, 1, 1.5])
/Users/efiring/miniconda3/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in arcsin
  """Entry point for launching an IPython kernel.
Out[11]:
array([0.92729522, 1.57079633,        nan])

We get a warning along with the desired result. We can suppress the warning like this:

In [12]:
with np.errstate(invalid='ignore'):
    np.arcsin([0.8, 1, 1.5])

but having to include such explicit handling of warnings is not always desirable.

Speed

In cases where everything is done using floating point, so missing values could be handled with Nan, masked arrays incur a speed penalty. Depending on the application for which the code is written, the convenience of masked arrays might or might not outweigh this penalty. We will show a couple of quick tests based on the example above to get an idea of how large the penalty might be.

In [13]:
x = rng.random((1000000,))
x[x > 1] = np.nan
y = x.copy()
y[y < 0] = 0
%timeit x / y
1.05 ms ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [14]:
xm = np.ma.masked_invalid(x)
%timeit xm / y
30.4 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

A factor of 30 for simple division: that's pretty bad. I suspect it could be greatly reduced by taking advantage of numpy capabilities that did not exist when the masked array sub-package was originally written, and possibly including a slight change in behavior, but for now we are stuck with it.

In [15]:
%timeit x * y
1.73 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]:
%timeit xm * y
1.96 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Multiplication: very little penalty.

In [17]:
x = rng.random((1000000,)) * 1.5
%timeit np.arcsin(x)
/Users/efiring/miniconda3/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in arcsin
  """Entry point for launching an IPython kernel.
12.2 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [18]:
%timeit np.ma.arcsin(x)
19.1 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Less than a factor of 2 for the function call: not so bad.

Methods and functions

Let's compare approaches with simple statistics, like mean and standard deviation. Make 3 arrays, one with no bad values, a second with some nans, and a third as a masked array version of the second.

In [19]:
x = rng.standard_normal((1000000,))
xnan = x.copy()
xnan[x < 0] = np.nan
xma = np.ma.masked_invalid(xnan)

The .mean() and .std() methods work on the ndarray with no bad values and on the masked array with or without bad values, assuming that what we want is the mean of the good values, ignoring the bad.

In [20]:
print(x.mean(), xnan.mean(), xma.mean())
0.0014130246813656977 nan 0.7965261413544447

To get the mean of the non-nan elements we need a function.

In [21]:
print(np.nanmean(xnan))
0.7965261413544447

Compare the speeds:

In [22]:
%timeit xma.mean()
7.96 ms ± 798 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [23]:
%timeit np.nanmean(xnan)
8.1 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [24]:
%timeit x.mean()
322 µs ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here we see that handling missing values is costly regardless of whether it is done via the masked array method or the nanmean function.

What about the equivalent ways of extracting mask arrays?

In [25]:
%timeit np.isnan(xnan)
304 µs ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]:
%timeit np.ma.getmaskarray(xma)
296 ns ± 7.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Very similar.

The special case where all values are nan or masked (or all values along an axis of a multidimensional array when using the axis kwarg) is another instance where using nans generates a warning that is probably just a nuisance.

In [27]:
np.nanmean([np.nan, np.nan])
/Users/efiring/miniconda3/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: Mean of empty slice
  """Entry point for launching an IPython kernel.
Out[27]:
nan
In [28]:
np.ma.masked_invalid([np.nan, np.nan]).mean()
Out[28]:
masked