Python for Data Analysis - CH.4 NumPy Basics - Arrays and Vectorized Computation

NumPy, short for Numerical Python. Most computational packages providing scientific functionality use NumPy’s array objects as the [[lingua franca]] for data exchange.

Because NumPy provides an easy-to-use C API, it is straightforward to pass data to external libraries written in a low-level language and also for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C/C++/Fortran codebases and giving them a dynamic and easy-to-use interface.

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

• NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.

• NumPy operations perform complex computations on entire arrays without the need for Python for loops.

4.1 The NumPy [[ndarray]]: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type.

Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

import numpy as np
data = np.random.randn(2, 3)
data
data * 10
data + data
data.shape
data.dtype

Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data.

data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1
# Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2
arr2.ndim
arr2.shape
# Unless explicitly specified (more on this later), np.array tries to infer a good data type for the array that it creates.
arr1.dtype
arr2.dtype
# In addition to np.array, there are a number of other functions for creating new arrays.
np.zeros(10)
np.zeros((3, 6))
np.empty((2, 3, 2))
np.arange(15)

It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values.

Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data

arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

dtypes are a source of NumPy’s flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran.

The numerical dtypes are named the same way: a type name, like float or int, followed by a number indicating the number of bits per element.

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method

arr = np.array([1, 2, 3, 4, 5])
arr.dtype
float_arr = arr.astype(np.float64)
float_arr.dtype
# If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr.astype(np.int32)
arr

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)

It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data.

If casting were to fail for some reason (like a string that cannot be converted to float64), a ValueError will be raised.

Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as the old dtype.

Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this [[vectorization]]. Any arithmetic operations between equal-size arrays applies the operation element-wise:

propagate - v. 散播,宣传(观点、信仰等);传播(运动、光线、声音等);(动植物等)繁殖,使繁殖

arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr * arr
arr - arr
# Arithmetic operations with scalars propagate the scalar argument to each element in the array
1 / arr
arr ** 0.5
# Comparisons between arrays of the same size yield boolean arrays
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2 > arr

Operations between differently sized arrays is called [[broadcasting]].

Basic Indexing and Slicing

One-dimensional arrays are simple; on the surface they act similarly to Python lists:

arr = np.arange(10)
arr[5]
arr[5:8]
arr[5:8] = 12
arr

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcasted ([[broadcasting]]) henceforth) to the entire selection.

An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy().


With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2]
arr2d[0, 2]

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions

arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d
arr3d[0]
# Both scalar values and arrays can be assigned to arr3d[0]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d
arr3d[0] = old_values

Indexing with slices

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:

arr
arr[1:6]

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different. As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression arr2d[:2] as “select the first two rows of arr2d.”

arr2d
arr2d[:2]
# You can pass multiple slices just like you can pass multiple indexes:
arr2d[:2, 1:]

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.

Boolean Indexing

Example:

Like arithmetic operations, comparisons (such as ==) with arrays are also vectorized.

This boolean array can be passed when indexing the array. The boolean array must be of the same length as the array axis it’s indexing.

You can even mix and match boolean arrays with slices or integers (or sequences of integers; more on this later).

Boolean selection will not fail if the boolean array is not the correct length, so I recommend care when using this feature.

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
names == 'Bob'
data[names == 'Bob']
# In these examples, I select from the rows where names == 'Bob' and index the col‐ umns, too:
data[names == 'Bob', 2:]
# To select everything but 'Bob', you can either use != or negate the condition using ~:
names != 'Bob'
data[~(names == 'Bob')] # The ~ operator can be useful when you want to invert a general condition
# Selecting two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like & (and) and | (or):
mask = (names == 'Bob') | (names == 'Will')
mask

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.

The Python keywords and and or do not work with boolean arrays. Use & (and) and | (or) instead.

Setting values with boolean arrays works in a common-sense way.

data[data < 0] = 0
data[names != 'Joe'] = 7

Fancy Indexing

[[Fancy indexing]] is a term adopted by NumPy to describe indexing using integer arrays.

arr = np.empty((8, 4))
for i in range(8):
arr[i] = i
# To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:
arr[[4, 3, 0, 6]]
# Multiple index arrays
arr = np.arange(32).reshape((8, 4))
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

Regardless of how many dimensions the array has, the result of fancy indexing is always one-dimensional.

The behavior of fancy indexing in this case is a bit different from what some users might have expected, which is the rectangular region formed by selecting a subset of the matrix’s rows and columns:

# A way to get that:
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.

Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything

Arrays have the transpose method and also the special T attribute:

arr = np.arange(15).reshape((3, 5))
arr.T

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using np.dot

arr = np.random.randn(6, 3)
np.dot(arr.T, arr)

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes (for extra mind bending)

arr = np.arange(16).reshape((2, 2, 4))
arr.transpose((1, 0, 2))
# Here, the axes have been reordered with the second axis first, the first axis second, and the last axis unchanged.

Simple transposing with .T is a special case of swapping axes. ndarray has the method swapaxes, which takes a pair of axis numbers and switches the indicated axes to rearrange the data

arr.swapaxes(1, 2)

4.2 Universal Functions: Fast Element-Wise Array Functions

[[ufunc]]

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

arr = np.arange(10)
np.sqrt(arr)
np.exp(arr)
# These are referred to as `unary` ufuncs.
# Others, such as add or maximum, take two arrays (thus, `binary ufuncs`) and return a single array as the result:
x = np.random.randn(8)
y = np.random.randn(8)
np.maximum(x, y)
# While not common, a ufunc can return multiple arrays. modf is one example, a vectorized version of the built-in Python divmod; it returns the fractional and integral parts of a floating-point array:
arr = np.random.randn(7) * 5
remainder, whole_part = np.modf(arr)

Ufuncs accept an optional out argument that allows them to operate in-place on arrays:

arr
np.sqrt(arr)
np.sqrt(arr, arr)

4.3 [[Array-Oriented Programming]] with Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops.

This practice of replacing explicit loops with array expressions is commonly referred to as [[vectorization]]

points = np.arange(-5, 5, 0.01)
xs, ys = np.meshgrid(points, points)
# Now, evaluating the function is a matter of writing the same expression you would write with two points:
z=np.sqrt(xs**2+ys**2)
import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray); plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y. Suppose we had a boolean array and two arrays of values:

[[numpy.where]]

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
# Suppose we wanted to take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr.
result = np.where(cond, xarr, yarr)
# The second and third arguments to np.where don’t need to be arrays; one or both of them can be scalars

A typical use of where in data analysis is to produce a new array of values based on another array. Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with –2. This is very easy to do with np.where:

arr = np.random.randn(4, 4)
arr > 0
np.where(arr > 0, 2, -2)

You can combine scalars and arrays when using np

Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function.

arr = np.random.randn(5, 4)
arr.mean()
np.mean(arr)
arr.sum()
# Functions like mean and sum take an optional axis argument that computes the statistic over the given axis, resulting in an array with one fewer dimension:
arr.mean(axis=1)
arr.sum(axis=0)
# Other methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()
# In multidimensional arrays, accumulation functions like cumsum return an array of the same size, but with the partial aggregates computed along the indicated axis according to each lower dimensional slice:
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr.cumsum(axis=0)
arr.cumprod(axis=1)

Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array.

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True. These methods also work with non-boolean arrays, where non-zero elements evaluate to True.

arr = np.random.randn(100)
(arr > 0).sum()
bools = np.array([False, False, True, False])
bools.any()
bools.all()

Sorting

Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sort method.

You can sort each one-dimensional section of values in a multidimensional array in-place along an axis by passing the axis number to sort/

arr = np.random.randn(6)
arr.sort()
arr = np.random.randn(5, 3)
arr.sort(1)

The top-level method np.sort returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank:

large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))]

Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays. A commonly used one is np.unique, which returns the sorted unique values in an array:

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)
nts = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)
# Contrast np.unique with the pure Python alternative:
sorted(set(names))

Another function, np.in1d, tests membership of the values in one array in another, returning a boolean array:

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

4.4 File Input and Output with Arrays

NumPy is able to save and load data to and from disk either in text or binary format. In this section I only discuss NumPy’s built-in binary format, since most users will prefer [[pandas]] and other tools for loading text or tabular data.

np.save and np.load are the two workhorse functions for efficiently saving and load‐ ing array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy

arr = np.arange(10)
np.save('some_array', arr)
np.load('some_array.npy')
np.savez('array_archive.npz', a=arr, b=arr)
# When loading an .npz file, you get back a dict-like object that loads the individual arrays lazily
arch = np.load('array_archive.npz')
arch['b']
# If your data compresses well, you may wish to use numpy.savez_compressed instead:
np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

4.5 Linear Algebra

Linear algebra, like matrix multiplication, decompositions, determinants, and other square matrix math, is an important part of any array library. Unlike some languages like MATLAB, multiplying two two-dimensional arrays with * is an element-wise product instead of a matrix dot product. Thus, there is a function dot, both an array method and a function in the numpy namespace, for matrix multiplication:

x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x.dot(y)
np.dot(x, y)
# The @ symbol (as of Python 3.5) also works as an infix operator that performs matrix multiplication
x @ np.ones(3)

numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood via the same industry- standard linear algebra libraries used in other languages like MATLAB and R, such as BLAS, LAPACK, or possibly (depending on your NumPy build) the proprietary Intel MKL (Math Kernel Library):

from numpy.linalg import inv, qr
X = np.random.randn(5, 5)
mat = X.T.dot(X)
inv(mat)
mat.dot(inv(mat))
q, r = qr(mat)
r

4.6 Pseudorandom Number Generation

The numpy.random module supplements the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.

We say that these are pseudorandom numbers because they are generated by an algorithm with deterministic behavior based on the seed of the random number generator. You can change NumPy’s random number generation seed using np.random.seed.

samples = np.random.normal(size=(4, 4))
np.random.seed(1234)
# The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator isolated from others:
rng = np.random.RandomState(1234)
rng.randn(10)

4.7 Example: Random Walks

The simulation of random walks provides an illustrative application of utilizing array operations. Let’s first consider a simple random walk starting at 0 with steps of 1 and –1 occurring with equal probability.

# A pure Python way:
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
step = 1 if random.randint(0, 1) else -1
position += step
walk.append(position)
plt.plot(walk[:100])
# A Numpy way
nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk = steps.cumsum()
#From this we can begin to extract statistics like the minimum and maximum value along the walk’s trajectory
walk.min()
walk.max()
# A more complicated statistic is the first crossing time, the step at which the random walk reaches a particular value.
# Here we might want to know how long it took the random walk to get at least 10 steps away from the origin 0 in either direction. np.abs(walk) >= 10 gives us a boolean array indicating where the walk has reached or exceeded 10, but we want the index of the first 10 or –10. Turns out, we can compute this using argmax, which returns the first index of the maximum value in the boolean array (True is the maximum value):
(np.abs(walk) >= 10).argmax()
# Note that using argmax here is not always efficient because it always makes a full scan of the array. In this special case, once a True is observed we know it to be the maxi‐ mum value.

Simulating Many Random Walks at Once

nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1)
walks.max()
walks.min()
# Out of these walks, let’s compute the minimum crossing time to 30 or –30. This is slightly tricky because not all 5,000 of them reach 30. We can check this using the any method:
hits30 = (np.abs(walks) >= 30).any(1)
hits30.sum()
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()
All rights reserved
Except where otherwise noted, content on this page is copyrighted.