When working with real oceanographic data sets, there are often gaps. One way of handling them is to fill them with NaN, and then write functions that treat NaNs however we desire. For example, we might want calculate the mean of all values that are not NaN. In some such cases, there is already a numpy function to do this type of calculation: numpy now includes `nanmean`

, `nanmax`

, `nanmin`

, `nanargmax`

, `nanargmin`

, `nanstd`

, `nanvar`

, and `nansum`

. The use of NaN as a bad value flag is typical in Matlab code.

Numpy, however, provides an alternative way to handle missing data: the `numpy.ma`

module, with its `MaskedArray`

subclass of the fundamental `numpy.ndarray`

class. There are a few rough edges in `numpy.ma`

, but it has some substantial advantages over relying on NaN, so I use it extensively. It is well supported in Matplotlib, and is used by default in the netCDF4 package.

Advantages of masked arrays include:

- They work with any type of data, not just with floating point.
- They can lead to simpler, more concise code.

Regardless of the degree to which you end up using masked arrays in your own code, you will encounter them, so you need to know at least a few things about them. This notebook barely scratches the surface. For a thorough introduction, see the numpy reference docs for ma. And, of course, check the docstrings directly as you are reading or writing code.

Suppose you are using a library that reads a file (e.g., netCDF) and returns the results as a masked array. Or, perhaps it might return either an ndarray or a masked array. Further, suppose you want to do your work with NaN-filled ndarrays. Here is an example of how you can ensure you end up with a floating point array in which
`np.nan`

replaces originally masked values, if any:

In [1]:

```
import numpy as np
# Later we will use random number sequences.
rng = np.random.default_rng(2020)
# make an example of an 8-bit integer masked array:
x = np.ma.array([1, 100, 2, 3], mask=[False, True, False, False],
dtype=np.int8)
print("The input integer masked array is:")
print(x)
xnan = np.ma.filled(x.astype(float), np.nan)
print("Converted to double precision, with nan:")
print(xnan)
```

We first used the `astype(float)`

method call to generate a double precision array. (In this case we could also use single precision with no loss of accuracy.) This method is available to ndarrays and to masked arrays, so it would work even if x were an ndarray. Next, this floating point array is used as the first argument to the `np.ma.filled`

function, which returns an ndarray of the same dtype, but with its second argument used to replace the masked values. If its first argument is already an ndarray, `np.ma.filled`

returns that argument unchanged.

There are other ways of accomplishing this nan-filled conversion, sometimes more efficiently (that is, without copying the data unnecessarily), but the method above is adequate for now.

Now let's see how we can *take advantage* of masked arrays instead of immediately converting them into ndarrays.

If we have input that might be a masked array or an ndarray, possibly with NaN and/or inf values, we can start by converting, if necessary, to a masked array:

In [2]:

```
# sample input ndarray:
x = np.array([1.0, 2.5, np.nan, 1.3, np.inf, 7.2])
print("input array with bad values:")
print(x)
xm = np.ma.masked_invalid(x)
print("masked version:")
print(xm)
```

The masked array has nearly all of the methods that an ndarray has, and a few special ones of its own. For example, to find out how many unmasked values it contains, there is the `count`

method:

In [3]:

```
print("xm has", xm.count(), "unmasked values")
```

To extract an ndarray containing only the unmasked values, use the `compressed`

method:

In [4]:

```
print("unmasked values are", xm.compressed())
```

For both ndarrays and masked arrays, there are often functions that correspond to methods, and vice versa. An advantage of using methods is that they inherently "do the right thing"--the method of a masked array includes functionality to deal with the mask. With both methods and functions, it is not always obvious when the returned object will be an ndarray and when it will be a masked array, so when it matters it is wise to check, either with a test or by reading the documentation.

Sometimes it is useful to extract the mask, perhaps to use for masking another variable. Use the `np.ma.getmaskarray`

function to get a full-size boolean mask corresponding to a given array, masked or not:

In [5]:

```
x = np.arange(12).reshape(3, 4)
print("sample ndarray, x:")
print(x)
print("\nnp.ma.getmaskarray(x):")
print(np.ma.getmaskarray(x))
```

When a masked array is created with no masked values, it's `mask`

attribute does not contain a full Boolean array; this is to save space and time, in case it turns out that nothing ever needs to be masked:

In [6]:

```
xm = np.ma.arange(6).reshape(2, 3)
print("fresh masked array, xm:")
print(xm)
print("\nxm.mask is actually np.ma.nomask, but prints as:")
print(xm.mask)
```

We have already seen one way of ending up with masked values: using `np.ma.masked_invalid`

. There are many more, e.g.:

In [7]:

```
x = np.arange(10)
xm = np.ma.masked_greater(x, 5)
print(xm)
```

Mask a set of values in an existing masked array using indexing:

In [8]:

```
xm = np.ma.arange(5)
xm[[1, 3]] = np.ma.masked
print(xm)
```

For operations like addition and multiplication, a masked value acts like a NaN: the output is masked. Division is more interesting: division by zero yields a masked value, not an error:

In [9]:

```
x = np.ma.array([1, 2, 3], mask=[False, False, True])
y = np.ma.array([1, 0, 1])
print(x * y)
print(x/y)
```

Similarly, evaluating a function with arguments outside the domain of that function simply yields masked values:

In [10]:

```
print(np.ma.arcsin([0.8, 1, 1.5]))
```

In the examples above, a mathematical operation silently masks the locations of invalid results. What happens without masked arrays?

In [11]:

```
np.arcsin([0.8, 1, 1.5])
```

Out[11]:

We get a warning along with the desired result. We can suppress the warning like this:

In [12]:

```
with np.errstate(invalid='ignore'):
np.arcsin([0.8, 1, 1.5])
```

but having to include such explicit handling of warnings is not always desirable.

In cases where everything is done using floating point, so missing values could be handled with Nan, masked arrays incur a speed penalty. Depending on the application for which the code is written, the convenience of masked arrays might or might not outweigh this penalty. We will show a couple of quick tests based on the example above to get an idea of how large the penalty might be.

In [13]:

```
x = rng.random((1000000,))
x[x > 1] = np.nan
y = x.copy()
y[y < 0] = 0
%timeit x / y
```

In [14]:

```
xm = np.ma.masked_invalid(x)
%timeit xm / y
```

A factor of 30 for simple division: that's pretty bad. I suspect it could be greatly reduced by taking advantage of numpy capabilities that did not exist when the masked array sub-package was originally written, and possibly including a slight change in behavior, but for now we are stuck with it.

In [15]:

```
%timeit x * y
```

In [16]:

```
%timeit xm * y
```

Multiplication: very little penalty.

In [17]:

```
x = rng.random((1000000,)) * 1.5
%timeit np.arcsin(x)
```

In [18]:

```
%timeit np.ma.arcsin(x)
```

Less than a factor of 2 for the function call: not so bad.

Let's compare approaches with simple statistics, like mean and standard deviation. Make 3 arrays, one with no bad values, a second with some nans, and a third as a masked array version of the second.

In [19]:

```
x = rng.standard_normal((1000000,))
xnan = x.copy()
xnan[x < 0] = np.nan
xma = np.ma.masked_invalid(xnan)
```

The `.mean()`

and `.std()`

methods work on the ndarray with no bad values
and on the masked array with or without bad values, assuming that what we
want is the mean of the good values, ignoring the bad.

In [20]:

```
print(x.mean(), xnan.mean(), xma.mean())
```

To get the mean of the non-nan elements we need a function.

In [21]:

```
print(np.nanmean(xnan))
```

Compare the speeds:

In [22]:

```
%timeit xma.mean()
```

In [23]:

```
%timeit np.nanmean(xnan)
```

In [24]:

```
%timeit x.mean()
```

Here we see that handling missing values is costly regardless of whether
it is done via the masked array method or the `nanmean`

function.

What about the equivalent ways of extracting mask arrays?

In [25]:

```
%timeit np.isnan(xnan)
```

In [26]:

```
%timeit np.ma.getmaskarray(xma)
```

Very similar.

The special case where all values are nan or masked (or all values
along an axis of a multidimensional array when using the `axis`

kwarg) is another instance where using nans generates a warning
that is probably just a nuisance.

In [27]:

```
np.nanmean([np.nan, np.nan])
```

Out[27]:

In [28]:

```
np.ma.masked_invalid([np.nan, np.nan]).mean()
```

Out[28]: