When working with real oceanographic data sets, there are often gaps. One way of handling them is to fill them with NaN, and then write functions that treat NaNs however we desire. For example, we might want calculate the mean of all values that are not NaN. In some such cases, there is already a numpy function to do this type of calculation: numpy now includes nanmean
, nanmax
, nanmin
, nanargmax
, nanargmin
, nanstd
, nanvar
, and nansum
. The use of NaN as a bad value flag is typical in Matlab code.
Numpy, however, provides an alternative way to handle missing data: the numpy.ma
module, with its MaskedArray
subclass of the fundamental numpy.ndarray
class. There are a few rough edges in numpy.ma
, but it has some substantial advantages over relying on NaN, so I use it extensively. It is well supported in Matplotlib, and is used by default in the netCDF4 package.
Advantages of masked arrays include:
Regardless of the degree to which you end up using masked arrays in your own code, you will encounter them, so you need to know at least a few things about them. This notebook barely scratches the surface. For a thorough introduction, see the numpy reference docs for ma. And, of course, check the docstrings directly as you are reading or writing code.
Suppose you are using a library that reads a file (e.g., netCDF) and returns the results as a masked array. Or, perhaps it might return either an ndarray or a masked array. Further, suppose you want to do your work with NaN-filled ndarrays. Here is an example of how you can ensure you end up with a floating point array in which
np.nan
replaces originally masked values, if any:
import numpy as np
# Later we will use random number sequences.
rng = np.random.default_rng(2020)
# make an example of an 8-bit integer masked array:
x = np.ma.array([1, 100, 2, 3], mask=[False, True, False, False],
dtype=np.int8)
print("The input integer masked array is:")
print(x)
xnan = np.ma.filled(x.astype(float), np.nan)
print("Converted to double precision, with nan:")
print(xnan)
We first used the astype(float)
method call to generate a double precision array. (In this case we could also use single precision with no loss of accuracy.) This method is available to ndarrays and to masked arrays, so it would work even if x were an ndarray. Next, this floating point array is used as the first argument to the np.ma.filled
function, which returns an ndarray of the same dtype, but with its second argument used to replace the masked values. If its first argument is already an ndarray, np.ma.filled
returns that argument unchanged.
There are other ways of accomplishing this nan-filled conversion, sometimes more efficiently (that is, without copying the data unnecessarily), but the method above is adequate for now.
Now let's see how we can take advantage of masked arrays instead of immediately converting them into ndarrays.
If we have input that might be a masked array or an ndarray, possibly with NaN and/or inf values, we can start by converting, if necessary, to a masked array:
# sample input ndarray:
x = np.array([1.0, 2.5, np.nan, 1.3, np.inf, 7.2])
print("input array with bad values:")
print(x)
xm = np.ma.masked_invalid(x)
print("masked version:")
print(xm)
The masked array has nearly all of the methods that an ndarray has, and a few special ones of its own. For example, to find out how many unmasked values it contains, there is the count
method:
print("xm has", xm.count(), "unmasked values")
To extract an ndarray containing only the unmasked values, use the compressed
method:
print("unmasked values are", xm.compressed())
For both ndarrays and masked arrays, there are often functions that correspond to methods, and vice versa. An advantage of using methods is that they inherently "do the right thing"--the method of a masked array includes functionality to deal with the mask. With both methods and functions, it is not always obvious when the returned object will be an ndarray and when it will be a masked array, so when it matters it is wise to check, either with a test or by reading the documentation.
Sometimes it is useful to extract the mask, perhaps to use for masking another variable. Use the np.ma.getmaskarray
function to get a full-size boolean mask corresponding to a given array, masked or not:
x = np.arange(12).reshape(3, 4)
print("sample ndarray, x:")
print(x)
print("\nnp.ma.getmaskarray(x):")
print(np.ma.getmaskarray(x))
When a masked array is created with no masked values, it's mask
attribute does not contain a full Boolean array; this is to save space and time, in case it turns out that nothing ever needs to be masked:
xm = np.ma.arange(6).reshape(2, 3)
print("fresh masked array, xm:")
print(xm)
print("\nxm.mask is actually np.ma.nomask, but prints as:")
print(xm.mask)
We have already seen one way of ending up with masked values: using np.ma.masked_invalid
. There are many more, e.g.:
x = np.arange(10)
xm = np.ma.masked_greater(x, 5)
print(xm)
Mask a set of values in an existing masked array using indexing:
xm = np.ma.arange(5)
xm[[1, 3]] = np.ma.masked
print(xm)
For operations like addition and multiplication, a masked value acts like a NaN: the output is masked. Division is more interesting: division by zero yields a masked value, not an error:
x = np.ma.array([1, 2, 3], mask=[False, False, True])
y = np.ma.array([1, 0, 1])
print(x * y)
print(x/y)
Similarly, evaluating a function with arguments outside the domain of that function simply yields masked values:
print(np.ma.arcsin([0.8, 1, 1.5]))
In the examples above, a mathematical operation silently masks the locations of invalid results. What happens without masked arrays?
np.arcsin([0.8, 1, 1.5])
We get a warning along with the desired result. We can suppress the warning like this:
with np.errstate(invalid='ignore'):
np.arcsin([0.8, 1, 1.5])
but having to include such explicit handling of warnings is not always desirable.
In cases where everything is done using floating point, so missing values could be handled with Nan, masked arrays incur a speed penalty. Depending on the application for which the code is written, the convenience of masked arrays might or might not outweigh this penalty. We will show a couple of quick tests based on the example above to get an idea of how large the penalty might be.
x = rng.random((1000000,))
x[x > 1] = np.nan
y = x.copy()
y[y < 0] = 0
%timeit x / y
xm = np.ma.masked_invalid(x)
%timeit xm / y
A factor of 30 for simple division: that's pretty bad. I suspect it could be greatly reduced by taking advantage of numpy capabilities that did not exist when the masked array sub-package was originally written, and possibly including a slight change in behavior, but for now we are stuck with it.
%timeit x * y
%timeit xm * y
Multiplication: very little penalty.
x = rng.random((1000000,)) * 1.5
%timeit np.arcsin(x)
%timeit np.ma.arcsin(x)
Less than a factor of 2 for the function call: not so bad.
Let's compare approaches with simple statistics, like mean and standard deviation. Make 3 arrays, one with no bad values, a second with some nans, and a third as a masked array version of the second.
x = rng.standard_normal((1000000,))
xnan = x.copy()
xnan[x < 0] = np.nan
xma = np.ma.masked_invalid(xnan)
The .mean()
and .std()
methods work on the ndarray with no bad values
and on the masked array with or without bad values, assuming that what we
want is the mean of the good values, ignoring the bad.
print(x.mean(), xnan.mean(), xma.mean())
To get the mean of the non-nan elements we need a function.
print(np.nanmean(xnan))
Compare the speeds:
%timeit xma.mean()
%timeit np.nanmean(xnan)
%timeit x.mean()
Here we see that handling missing values is costly regardless of whether
it is done via the masked array method or the nanmean
function.
What about the equivalent ways of extracting mask arrays?
%timeit np.isnan(xnan)
%timeit np.ma.getmaskarray(xma)
Very similar.
The special case where all values are nan or masked (or all values
along an axis of a multidimensional array when using the axis
kwarg) is another instance where using nans generates a warning
that is probably just a nuisance.
np.nanmean([np.nan, np.nan])
np.ma.masked_invalid([np.nan, np.nan]).mean()