The ndarray

Any time we are working with data, we will need the numpy package; it's central to Python in science. Let's import it with the customary abbreviation.

In [1]:
import numpy as np

A little function will save us some typing in this notebook. In its definition below, notice how we are accessing 3 attributes of its ndarray argument.

In [2]:
def show_ndarray(x):
    print("argument type is %s" % type(x))
    print("number of dimensions: %d" % x.ndim)
    print("shape is %s" % str(x.shape))
    print("dtype is %s" % x.dtype)
    print("content is\n%s" % x)

Although the basic array type in numpy is the ndarray class, in practice ndarrays are almost never created by calling the class directly. Instead, various numpy functions are used:

In [3]:
x = np.arange(5)
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 1
shape is (5,)
dtype is int64
content is
[0 1 2 3 4]

The arange function takes arguments start, stop, and step:

In [4]:
x = np.arange(3.2, 5.7, 0.11)
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 1
shape is (23,)
dtype is float64
content is
[ 3.2   3.31  3.42  3.53  3.64  3.75  3.86  3.97  4.08  4.19  4.3   4.41
  4.52  4.63  4.74  4.85  4.96  5.07  5.18  5.29  5.4   5.51  5.62]

Sometimes it is more convenient to specify how many numbers you want within a range; for that, use linspace. Check the docstring to find out what additional keyword argument can be supplied.

In [5]:
x = np.linspace(0, np.pi * 2, num=10)
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 1
shape is (10,)
dtype is float64
content is
[ 0.          0.6981317   1.3962634   2.0943951   2.7925268   3.4906585
  4.1887902   4.88692191  5.58505361  6.28318531]

In [6]:
x = np.logspace(0, 2, num=10)
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 1
shape is (10,)
dtype is float64
content is
[   1.            1.66810054    2.7825594     4.64158883    7.74263683
   12.91549665   21.5443469    35.93813664   59.94842503  100.        ]

Arrays initialized with zeros or ones are available.

In [7]:
x = np.zeros((2, 3))
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 2
shape is (2, 3)
dtype is float64
content is
[[ 0.  0.  0.]
 [ 0.  0.  0.]]

In [8]:
x = np.ones((2, 3), dtype=int)
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 2
shape is (2, 3)
dtype is int64
content is
[[1 1 1]
 [1 1 1]]

There are other related functions, but we won't go through them all.

A python sequence such as a list can be used to create and initialize an ndarray using the array function; a list of lists corresponds to a 2-D array, etc.

In [9]:
x = np.array([[1, 2, 3], [4, 5, 6]], dtype=float)
show_ndarray(x)
argument type is <class 'numpy.ndarray'>
number of dimensions: 2
shape is (2, 3)
dtype is float64
content is
[[ 1.  2.  3.]
 [ 4.  5.  6.]]

In the examples above we have introduced 6 numpy functions that can be used to create an ndarray from scratch; we have shown how the kind of number (the dtype; more about that later in this notebook) stored in the array can be specified if one does not want the default; and we have shown that ndarrays have attributes (attached variables) that provide information about the array.

Mathematical operations

Unlike Matlab, numpy defines array operations as element-wise by default; for example, multiplication in Matlab is matrix multiplication, but in numpy it is the multiplication of corresponding elements, equivalent to ".*" in Matlab.

In [10]:
x = np.array([[1, 2, 3], [4, 5, 6]], dtype=float)
y = np.array([[3, 2, 1], [6, 5, 4]], dtype=float)
print(x + y)
[[  4.   4.   4.]
 [ 10.  10.  10.]]

In [11]:
print(x * y)
[[  3.   4.   3.]
 [ 24.  25.  24.]]

In [12]:
print(x / y)
[[ 0.33333333  1.          3.        ]
 [ 0.66666667  1.          1.5       ]]

In [13]:
print(x ** 2)
[[  1.   4.   9.]
 [ 16.  25.  36.]]

Matrix multiplication is available, of course. Using x and y from the previous cell, we need to transpose one or the other in order to be able to do a matrix multiplication. We can use the transpose() method of the ndarray, or its .T abbreviation. The simplest numpy function for matrix multiplication is np.dot().

In [14]:
print(np.dot(x, y.T))
[[ 10.  28.]
 [ 28.  73.]]

x and y each have 2 rows and 3 columns, so the matrix product above is a 2x2 array.

In [15]:
print(np.dot(x.transpose(), y))
[[ 27.  22.  17.]
 [ 36.  29.  22.]
 [ 45.  36.  27.]]

Transposing x instead gives a 3x3 array.

Indexing

Numpy indexing extends normal Python indexing to multiple dimensions, and provides additional modes. Here, as throughout this tutorial, we will illustrate some of the basics without trying to cover all possibilities. For a more thorough description, see http://docs.scipy.org/doc/numpy/user/basics.indexing.html and http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html.

To make a 2-D array for demonstration purposes, we will start with a 1-D array and reshape it to 2-D:

In [16]:
xtest = np.arange(12)
xtest = xtest.reshape((3, 4))
print(xtest)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Simple indexing is done with combinations of single indices and slices. Notice how indexing with a slice makes no change in the dimensionality, but indexing with a single index removes a dimension.

In [17]:
# Extract a single element from a 2-D array:
print(xtest[1, 2])
6

In [18]:
# Slice with row 1, scalar index 2 for column,
# so we are left with a 1-D array:
print(xtest[1:2, 2])
[6]

In [19]:
# Slice with the last 2 columns, scalar index
# 1 for the row, so again we have a 1-D array:
print(xtest[1, ::2])
[4 6]

In [20]:
# Slice for all rows but the first, and slice
# for the first two columns, so we have a 2-D array:
print(xtest[1:, :2])
[[4 5]
 [8 9]]

As in ordinary Python sequences, a slice object can be used in place of the colon notation. (See http://docs.python.org/2.7/library/functions.html?highlight=slice#slice for a description of the slightly unintuitive slice() function signature.) Slice objects are surprisingly handy, for example, when one needs to slice several arrays in the same way.

In [21]:
rowsl = slice(1, None)  # equiv to 1:
colsl = slice(2)  # equiv to :2
print(xtest[1:, :2])
print("is the same as")
print(xtest[rowsl, colsl])
[[4 5]
 [8 9]]
is the same as
[[4 5]
 [8 9]]

In [22]:
# slice objects can use negative indices:
colsl = slice(-2)  # equiv to :-2
print(xtest[rowsl, colsl])
[[4 5]
 [8 9]]

Rows of a 2-D array can be selected using only the row indices:

In [23]:
print(xtest[1])     # first row
print(xtest[1, :])  # same thing
[4 5 6 7]
[4 5 6 7]

In [24]:
print(xtest[-2:])  # last two rows
[[ 4  5  6  7]
 [ 8  9 10 11]]

Again, notice that any time a single index is used, the dimensionality is reduced by one; but slicing preserves dimensionality. If one wanted a 2-D array with only the first row of xtest, one could use a slice instead of an index:

In [25]:
print(xtest[1:2])
[[4 5 6 7]]

Selecting columns can be done like this:

In [26]:
print(xtest[:, 1])
[1 5 9]

The result is a 1-D array, because we indexed a 2-D array with a single index. There is no such thing as a 1-D column versus a 1-D row; it's just a 1-D array.

Of course, if we slice the rows and the columns, we get another 2-D array:

In [27]:
print(xtest[:, -2:])
[[ 2  3]
 [ 6  7]
 [10 11]]

In [28]:
print(xtest[..., -2:])
[[ 2  3]
 [ 6  7]
 [10 11]]

The ellipsis in the last example takes the place of any number of dimensions to the left, while each colon expression (or slice object) applies to a single dimension.

Indexing can be used on the left of an assignment to set selected rows, columns, or values:

In [29]:
xtest1 = xtest.copy()
xtest1[1, 2] = 100
print(xtest1)
[[  0   1   2   3]
 [  4   5 100   7]
 [  8   9  10  11]]

In [30]:
xtest1[-1] = 33 # sets the last row
print(xtest1)
[[  0   1   2   3]
 [  4   5 100   7]
 [ 33  33  33  33]]

In [31]:
xtest1[:, -1] = [99, 88, 77]  # last column
print(xtest1)
[[  0   1   2  99]
 [  4   5 100  88]
 [ 33  33  33  77]]

Broadcasting

Suppose we want to subtract the row mean from each row, or the column mean from each column:

In [32]:
xm = xtest.mean(axis=1)
print("Row means are: ", xm)
print("xtest with row means subtracted out:")
print(xtest - xm[:, np.newaxis])
Row means are:  [ 1.5  5.5  9.5]
xtest with row means subtracted out:
[[-1.5 -0.5  0.5  1.5]
 [-1.5 -0.5  0.5  1.5]
 [-1.5 -0.5  0.5  1.5]]

First, we take the mean of each row by supplying the keyword argument (typically abbreviated as "kwarg") axis=1 so that the mean would be taken by summing over the the column index, which is 1 because the numbering starts from zero. This gives a 1-D array with a mean for each row.

To subtract this row-mean from each column, we need to turn it into a column and effectively replicate it for each column of the original array. That is what the indexing syntax

[:, np.newaxis] 

does above. It is very efficient because it does not actually have to generate this as a new rectangular array--it just causes the calculation to be done as if such an array had been made.

In [33]:
print(xtest - xtest.mean(axis=0))  # Broadcasting:
   # equivalent to xtest - xtest.mean(axis=0)[np.newaxis, :]
[[-4. -4. -4. -4.]
 [ 0.  0.  0.  0.]
 [ 4.  4.  4.  4.]]

We take the mean of each column by summing over the row index, hence the kwarg axis=0.

To subtract the column-mean, we could use the indexing syntax

[np.newaxis, :]

but we don't need to. This is done automatically by numpy. It is called broadcasting. In an arithmetic operation between two ndarrays, if the arrays differ in shape only insofar as one of them is missing one or more dimensions on the left, then these dimensions are effectively added by replication--again, not by making a new array, but by making the calculation behave as if such an array had been made.

In recent versions of numpy, some functions and methods such as mean have a handy kwarg, keepdims, that eliminates the need for explicit indexing to restore a dimension to the right. See if the following works with the version of numpy you are using:

In [34]:
# Remove the row-mean via indexing:
print(xtest - xtest.mean(axis=1)[:, np.newaxis])
print("is the same as")
# Use the new kwarg instead of indexing:
print(xtest - xtest.mean(axis=1, keepdims=1))
print("if your version of numpy is not too old")
[[-1.5 -0.5  0.5  1.5]
 [-1.5 -0.5  0.5  1.5]
 [-1.5 -0.5  0.5  1.5]]
is the same as
[[-1.5 -0.5  0.5  1.5]
 [-1.5 -0.5  0.5  1.5]
 [-1.5 -0.5  0.5  1.5]]
if your version of numpy is not too old

Simple indexing returns a view, not a copy

Numpy is designed to use computer resources efficiently, both for speed and to permit calculations with very large arrays. One of the efficiency strategies is to avoid copying arrays unless it is really necessary. Broadcasting and the use of np.newaxis as discussed above is one example. Another is illustrated next. The basic idea is that some operations with a part of an array, or with a whole array but rearranged in some fashion (e.g, transposed) do not require copying the numbers in the array into a new block of memory; instead, one can simply view them differently, or view a subset of them. This leads to two alternatives: a function or method can copy an ndarray, or it can create a new view of part or all of an array.

For the following examples, we will first make a new test array by copying an earlier one, so that changes to the new array will leave the original unchanged:

In [35]:
xtest2 = xtest.copy()
print(xtest2)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Next, we will use slicing to subsample xtest2 by 2 along each axis.

In [36]:
xtest2view_odds = xtest2[1::2, 1::2]
print("xtest2view_odds originally contains:")
print(xtest2view_odds)
xtest2view_odds[:] = 200
print("after assigning 200 to each element of xtest2view_odds,")
print("xtest2 is also changed; it now contains:")
print(xtest2)
xtest2view_odds originally contains:
[[5 7]]
after assigning 200 to each element of xtest2view_odds,
xtest2 is also changed; it now contains:
[[  0   1   2   3]
 [  4 200   6 200]
 [  8   9  10  11]]

xtest2view_odds is a view of xtest2, exposing the elements with odd row and column number. When we set all of the elements of xtest2view_odds to 200, we are setting those selected elements in xtest2 itself. If instead we want a new array that is initialized with the odd rows and columns of xtest2, then we have to explicitly make a copy:

In [37]:
# remake xtest2
xtest2 = xtest.copy()
print("xtest2 is initially:")
print(xtest2, "\n")  # original

xtest2_odds = xtest2[1::2, 1::2].copy()  # here!
print("xtest2_odds is initially:")
print(xtest2_odds)
xtest2_odds[:] = 400
print("after assigning 400 to xtest2_odds it is:")
print(xtest2_odds, "\n")
print("but xtest2 remains unchanged:")
print(xtest2)  # (unchanged)
print("because this time xtest2_odds started as a copy,")
print("not a view into xtest2")
xtest2 is initially:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 

xtest2_odds is initially:
[[5 7]]
after assigning 400 to xtest2_odds it is:
[[400 400]] 

but xtest2 remains unchanged:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
because this time xtest2_odds started as a copy,
not a view into xtest2

Advanced ("fancy") indexing

Numpy's fancy indexing is capable of rather complicated operations, but in more typical use it is straightforward. Here we will stick to the simple uses.

The distinction between simple and fancy indexing is based on the contents of the indexing expression.

  • If that expression (actually a tuple inside the square brackets, but the round parentheses are usually omitted) contains only single indices and/or slices, then simple indexing, returning a view into the original array, is in effect.
  • If any element of the indexing tuple is a list or ndarray, however, the indexing is fancy, and a copy, not a view, is returned.

The behavior is easy to understand so long as only one element of the indexing tuple is a list or ndarray. This is the case to which we will restrict ourselves.

First, let's verify that indexing actually is being done with a tuple--we can even use a named tuple, like this:

In [38]:
ind_tup = (slice(1, None, 2), slice(1, None, 2))
print("indexing with a named tuple, xtest[ind_tup]:")
print(xtest[ind_tup])
print("is equivalent to using colon notation, xtest[1::2, 1::2]:")
print(xtest[1::2, 1::2])
indexing with a named tuple, xtest[ind_tup]:
[[5 7]]
is equivalent to using colon notation, xtest[1::2, 1::2]:
[[5 7]]

Now let's use fancy indexing with a list of column indices to pull out the first and last column, forming a new array, not a view into xtest2:

In [39]:
xtest2 = xtest.copy()
xfirstlast = xtest2[:, [0, -1]]
print(xfirstlast)
[[ 0  3]
 [ 4  7]
 [ 8 11]]

Above we used

[:, [0, -1]]

to take all of the rows (because of the first colon--a slice) and the first and last column (because our list of column indices is [0, -1]). This inclusion of an index list triggers the fancy indexing behavior, so that a copy was returned.

Next, we will set all of the elements of this new array, xfirstlast, to 200. Finally, by printing the original array, xtest2, we verify that its contents were not changed when we assigned new values to xfirstlast.

In [40]:
xfirstlast[:] = 200
print("xfirstlast after filling with 200:")
print(xfirstlast, "\n")
print("xtest2 was unchanged, because xfirstlast is not a view:")
print(xtest2)
xfirstlast after filling with 200:
[[200 200]
 [200 200]
 [200 200]] 

xtest2 was unchanged, because xfirstlast is not a view:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

If we want to modify the contents of selected rows, columns, or elements of xtest2, we can do that using fancy indexing on the left side of an assignment. Let's set the first and last column to 500:

In [41]:
xtest2[:, [0, -1]] = 500
print(xtest2)
[[500   1   2 500]
 [500   5   6 500]
 [500   9  10 500]]

Boolean fancy indexing

Fancy indexing can be done with Boolean arrays; again, on the left of an assignment it allows setting selected rows, columns, or elements, and on the right of an assignment it returns a copy, not a view.

First, make a 1-D Boolean array that is True for each column in which the top element (that is, from the first row) equals 500. Call that array jbig. Then use this to select the corresponding columns, and make a new array, xtest2big, consisting of those columns:

In [42]:
jbig = xtest2[0] == 500
xtest2big = xtest2[:, jbig]
print(xtest2big)
[[500 500]
 [500 500]
 [500 500]]

Next, illustrate fancy indexing with a single 1-D Boolean array (jbig) on the left by changing the values in the first and last column of xtest2 to 400:

In [43]:
xtest2[:, jbig] = 400
print(xtest2, "\n")
[[400   1   2 400]
 [400   5   6 400]
 [400   9  10 400]] 


Assigning elements based on a logical condition

A single Boolean array indexing on the left can be used as one would expect to assign elements based on a logical condition.

For example, let's make a new 2-D array of sequential integers, and then invert the sign of all the even integers:

In [44]:
xtest3 = np.arange(15).reshape((3, 5))
print(xtest3, "\n")

cond = xtest3 % 2 == 0
# Now cond is a Boolean array with the same shape as xtest3,
# with a True value wherever xtest3 is even.
xtest3[cond] = - xtest3[cond]
print(xtest3)
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]] 

[[  0   1  -2   3  -4]
 [  5  -6   7  -8   9]
 [-10  11 -12  13 -14]]

Combining logical operations

Owing to a deep-seated aspect of Python's design, combining numpy logical conditions to make a single Boolean array for indexing can't be done in quite the way one would like, using genuine logical operators; instead, either a verbose form using a function, or a little hack, is needed. First, here is the verbose form in which one uses np.logical_or() or np.logical_and() to combine Boolean arrays:

In [45]:
xtest4 = np.arange(15).reshape((3, 5))
cond1 = xtest4 > 10
cond2 = xtest4 < 2
cond = np.logical_or(cond1, cond2)
xtest4[cond] = 100
print("We selected all elements less than 2 or greater than 10")
print("and changed them to 100:")
print(xtest4)
We selected all elements less than 2 or greater than 10
and changed them to 100:
[[100 100   2   3   4]
 [  5   6   7   8   9]
 [ 10 100 100 100 100]]

Now the hack, which takes advantage of the fact that Boolean values are actually 1 for True and 0 for False, and uses the bit-wise operators, | and &. The problem with this, other than that it feels like cheating--it's really a trick--is that these bit-wise operators have higher precedence than proper logical operators would, so one must use parentheses to make the operations group correctly.

In [46]:
xtest4 = np.arange(15).reshape((3, 5))
cond = (xtest4 < 2) | (xtest4 > 10)
xtest4[cond] = 100
print(xtest4)
[[100 100   2   3   4]
 [  5   6   7   8   9]
 [ 10 100 100 100 100]]

Notice that in all the examples above, the cond array, used as the sole indexing object for the xtest3 or xtest4 array, has the same dimensions as the array it is indexing, and is of Boolean dtype:

In [47]:
print("characteristics of xtest4:")
show_ndarray(xtest4)
print("\ncharacteristics of cond:")
show_ndarray(cond)
characteristics of xtest4:
argument type is <class 'numpy.ndarray'>
number of dimensions: 2
shape is (3, 5)
dtype is int64
content is
[[100 100   2   3   4]
 [  5   6   7   8   9]
 [ 10 100 100 100 100]]

characteristics of cond:
argument type is <class 'numpy.ndarray'>
number of dimensions: 2
shape is (3, 5)
dtype is bool
content is
[[ True  True False False False]
 [False False False False False]
 [False  True  True  True  True]]

Array sizes cannot be changed

In Matlab, arrays can be expanded simply by assigning to indices that are outside the orginal range; in numpy, the size of an array cannot be changed after the array is created. As we have seen, one can copy a subset of an array, or use slicing to access a view of a subarray, but instead of dynamically expanding an array, one must create a new one of the desired size and copy in the data from the original, or use concatenation to make a new array from two or more existing arrays with compatible shapes.

In the following, we start with an existing array, y, we create a larger array, ynew, and we fill the first part of ynew with the contents of y:

In [48]:
y = np.arange(5)
ynew = np.zeros((7,), dtype=int)
ynew[:5] = y
print(ynew)
[0 1 2 3 4 0 0]

Next, let's use concatenation to horizontally stack (np.hstack()) two single-row arrays.

In [49]:
y_extra = np.array([10, 20])
ynew = np.hstack((y, y_extra))
print(ynew)
[ 0  1  2  3  4 10 20]

If we have two or more single-row arrays of the same length, we can stack them vertically into a 2-D array.

In [50]:
yy = y[::-1]
ynew = np.vstack((y, yy))
print(ynew)
[[0 1 2 3 4]
 [4 3 2 1 0]]

Of course, concatenation can be used with multiple arrays of multiple dimensions; for more information and details, look up the documentation for np.hstack, np.vstack, and np.concatenate.
Note that in the examples above, all those parentheses are necessary, because the sequence of arrays to be concatenated must be given as a single argument, usually a tuple.

The dtype

An ndarray contains objects of a given type, recorded by the dtype attribute. The desired dtype can be set when the ndarray is created, and an array can be converted from one dtype to another. This is a fundamental aspect of numpy that provides great power and flexibility. The dtype attribute is actually a fairly complicated object in its own right, with several attributes, and the ability to specify not just a basic numeric type (like unsigned 16-bit integer, or single-precision floating point), but also a record structure with named fields that can have different types. For details, see

When the array function is used to convert a Python sequence, nested sequences, into an ndarray, the dtype can be specified explicitly, or the function can be left to pick a dtype that is suitable for the given input.

In [51]:
x1 = np.array([1, 2, 2, 0])
print(x1.dtype)
int64

In [52]:
x2 = np.array([1, 2, 2.0, 0])
print(x2.dtype)
float64

Notice that in the examples above the dtype determination was based on the type of object being supplied, not the value. A single float was sufficient to cause selection of the float64 dtype.

Often it is better to specify the desired dtype at the outset, e.g.:

In [53]:
x3 = np.array([1, 2, 2, 0], dtype=float)
print(x3.dtype)
float64

A heterogeneous array can be created by specifying the most general dtype, object. The result is similar to a Matlab cell array.

In [54]:
x4 = np.array(["a", (1,2), [3,4]], dtype=object)
print(x4.dtype)
object

When mathematical operations involve arguments with different dtypes, numpy tries to pick an output dtype that is as safe as possible--that is, least likely to lose information. Here is a simple example. Adding int64 to float64 (double precision) yields the latter.

In [55]:
x3 = x1 + x2
print(x3.dtype)
float64

If we have structured information (e.g., conceptually a table of records with named fields), we can use a structured dtype:

In [56]:
info_type = np.dtype([('name', 'S20'), ('height', np.float32), ('weight', np.int16)])
data = [("Henry", 6.1, 185), ("Harriet", 5.7, 132)]
x5 = np.empty((2,), dtype=info_type)
x5[:] = data
print(x5[1])
(b'Harriet', 5.699999809265137, 132)

Above, we printed the second record (index 1 into the array).

Below, we print the 'weight' from each record; since we have two records, we get two weights.

In [57]:
print(x5['weight'])
[185 132]

Sometimes it is more convenient to use attribute access style to specify a field. For this we can view the structured array as a subclass called a recarray:

In [58]:
x5rec = x5.view(np.recarray)
print("x5rec.name is:", x5rec.name)
print("x5['name'] is:", x5['name'])
print("\nx5rec.weight[-1] is:", x5rec.weight[-1])
print("x5['weight'][-1] is:", x5['weight'][-1])
x5rec.name is: [b'Henry' b'Harriet']
x5['name'] is: [b'Henry' b'Harriet']

x5rec.weight[-1] is: 132
x5['weight'][-1] is: 132

In [58]: