NumPy arrays are the starting point for nearly all hard math and science work in Python.
NumPy is the most popular mathematics library for Python. NumPy takes a big step toward making Python as fast as C for serious mathematical computations. There are hundreds of scientific and mathematical libraries in Python that just could not exist without NumPy. Several of these libraries we will cover in this sourse: SciPy, matplotlib, pandas, and netCDF4.
For sure, NumPy is a big math library with more than just np.array
. But you have to start somewhere, so this lecture will focus on NumPy arrays.
NumPy is the first third-party library we will use in this class. But it won't be the last. There are a ton of amazing tools written for Python that you as a scientist/engineer/geek/whatever will want to use. But they don't come pre-packaged with Python. You will have to install them separately.
You will want Python v3.3 (or newer) to use NumPy and all of the other libraries that require it.
You can find instructions for installing NumPy here.
Please Note The NumPy group has said they will be dropping support for Python 2.X on Jan 1, 2020. Since this library is the basis of nearly all science and engineering work in Python it will be very important that you move from Python 2.X to Python 3.X at some point.
Consider installing Anaconda instead. Anaconda is Python packaged with hundreds of tools and libraries that you will want (This includes NumPy and everything else we will use in this course.)
People use NumPy a lot, and as such, they get tired of doing:
import numpy
numpy.array
numpy.ones
# numpy.whatever
So, the unofficial community standard import is to do:
import numpy as np
And then to do:
np.array
np.ones
# np.whatever
This "renaming the import" is super common, so we will try to use it here.
The list
is the standard ordered-sequence data structure in Python. The Python list
is an extremely flexible tool. But, it turns out, that flexibility costs speed. NumPy introduces its own data structure, the array:
One of the first differences you will find is that, unlike lists, all of the items in a NumPy np.array
have to be of the same type.
>>> import numpy as np
>>>
>>> lst = [1, 2, 3, 4.5]
>>> lst
[1, 2, 3, 4.5]
>>> a = np.array([1, 2, 3, 4.5])
>>> a
array([ 1., 2., 3., 4.5])
Do you see what happened? Python automatically typecast all of the elements in the array
to be of the same type. And since you would lose information going from 4.5 to 4, all of the elements in your array
had to become decimals.
As well as having it's own data structure, NumPy goes one step further and has it's own types:
>>> type(lst[0])
int
>>> type(a[0])
<type 'numpy.float64'>
The NumPy library tries to default all of your numbers (integers, decimals, etc) to 64-bit versions. And there are NumPy alternatives to all the normal Python primative types:
- int --> int64 (thought int16 and int32 are available)
- float --> float64 (thought float16 and float32 are available)
There are, of course, many other data types in NumPy. For a full list, look here
One difference between lists and NumPy.arrays is that arrays don't just have to be one-dimensional:
>>> import numpy as np
>>>
>>> np.array([[1, 2, 3], [7, 8, 9]])
array([[1, 2, 3],
[7, 8, 9]])
>>>
>>> np.array([[1, 2, 3], [7, 8, 9]], dtype=float)
array([[ 1., 2., 3.],
[ 7., 8., 9.]])
And if you start out with a 1D array
, you can make a 2D array
using reshape
:
>>> a = np.array([1, 2, 3, 4.5])
>>> b = a.reshape(2, 2)
>>> b
array([[ 1. , 2. ],
[ 3. , 4.5]])
The .reshape()
method is really pretty smart. It doesn't move any of the data around in memory, which would be quite slow. All it does is change how you access data. This is an extremely convenient feature that will almost always make your life easier.
What do you think will happen if you run this code?
>>> a = np.array([1, 2, 3, 4.5])
>>> c = a.reshape(3, 3)
You can use numpy.arange
to fill a numpy.array
much like you used range
to fill a Python-standard list
:
>>> count = list(range(5))
>>> count
[0, 1, 2, 3, 4]
>>>
>>> c = np.arange(18)
>>> c
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17])
The numpy.arange
function can work like it does above, or it can take three paramters: min
, max
, and step
:
>>> np.arange(2, 15, 4)
array([ 2, 6, 10, 14])
What do you think the following code will produce?
>>> np.arange(2, 15)
Here's a quick example using np.arange
and reshape
together:
>>> d = np.arange(18).reshape(3,6)
>>> d
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
You can also create a multi-dimensional np.array
right from the start:
>>> e = np.array([[0, 1, 2, 3], [4, 5, 6, 7]])
>>> e
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
And, unlike the standard Python libraries, NumPy will let you define the type of the array:
>>> f = np.array([[0, 1, 2, 3], [4, 5, 6, 7]], np.float32)
>>> f
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.]])
Frequently, you will want to initialize an np.array
with all zero values:
>>> z = np.zeros(5, dtype=np.int64)
>>> z
array([0, 0, 0, 0, 0])
>>>
>>> y = np.zeros((2, 3), dtype=np.float32)
>>> y
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
Similarly, you can use ones
to initialize an array to all 1 values:
>>> np.ones(4)
array([ 1., 1., 1., 1.])
>>>
>>> np.ones((2, 5), dtype=np.int64)
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
One of the great things about numpy is that it makes operating on every element of an array super easy. For instance, if you want to add or subtract two arrays:
>>> a = np.array([1, 2, 3])
>>> b = np.array([-1, 2, 3])
>>>
>>> a + b
array([0, 4, 6])
>>>
>>> a - b
array([2, 0, 0])
And you can do math an numpy arrays using just regular numbers:
>>> 2 * a
array([2, 4, 6])
>>>
>>> a + 1
array([2, 3, 4])
And you can combine the two:
>>> 2 * (a + b) - 4
array([-4, 4, 8])
This functionality saves a lot of tedious code writing. And the resulting operations are usually much faster than they would be written using Python lists.
What would you expect this to produce?
>>> 2 * (1 - a + b)
Another great feature of numpy arrays is the huge variety of helper methods.
Use .ndim
to determine how many dimensions your multi-dimensional np.array
has:
>>> r = np.zeros((3, 2), dtype=np.float64)
>>> r
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])
>>> r.ndim
2
And another example:
>>> cube = np.zeros((2, 2, 2), dtype=np.float64)
>>> cube.ndim
3
Use .shape
get more information about the structure of your np.array
:
>>> r.shape
(3, 2)
>>>
>>> cube.shape
(2, 2, 2)
Most frequently, I use .shape
to get just one of the dimensions of the np.array
:
>>> r.shape[0]
3
Use .dtype
to get the type of the elements in an array
:
>>> r.dtype
dtype('float64')
Use flatten
to convert a multi-dimensional array
to a single dimension:
>>> a = np.array([[2,3,4],[7,8,9]])
>>> a
array([[2, 3, 4],
[7, 8, 9]])
>>>
>>> a.flatten()
array([2, 3, 4, 7, 8, 9])
Remember, this is quite fast because the data is not being move around, it is only changing out we access it.
Use transpose
to flip the x and y directions in your array
:
>>> a = np.array([[2, 3, 4], [7, 8, 9]])
>>> a
array([[2, 3, 4],
[7, 8, 9]])
>>>
>>> a.transpose()
array([[2, 7],
[3, 8],
[4, 9]])
Alternatively, you can just use the shorthand .T
to do the same thing.
>>> a = np.array([[2, 3, 4], [7, 8, 9]])
>>> a.T
array([[2, 7],
[3, 8],
[4, 9]])
>>>
>>> a
array([[2, 3, 4],
[7, 8, 9]])
Notice that neither of these methods changes what is in the a
place in memory; they return a totally new array.
NumPy even has mathematical functions designed to act on entire arrays. A lot of them, like sqrt
:
>>> a = np.array([1, 4, 9, 25, 144, 81])
>>> np.sqrt(a)
array([ 1., 2., 3., 5., 12., 9.])
Use ceil
and floor
to round NumPy np.float64
s up or down to the nearest integer:
>>> a = np.array([1.001, 2.49, 2.5, 3.5001, 9.9])
>>>
>>> np.ceil(a)
array([ 2., 3., 3., 4., 10.])
>>> np.floor(a)
array([ 1., 2., 2., 3., 9.])
What would you expect this to return?
>>> x = np.array([3.912, 15.8999, 35.98989])
>>> np.floor(np.sqrt(x))
NumPy also includes the ability to make the sum and product of all the elements in an array:
>>> a = np.array([1, 2, 3, 4, 5, 6])
>>>
>>> np.sum(a)
21
>>> np.prod(a)
720
And since they are built into NumPy, sum
and prod
can handle multi-dimensional arrays:
>>> m = np.array([[1, 2, 3], [4, 5, 6]])
>>> m
array([[1, 2, 3],
[4, 5, 6]])
>>> np.sum(m)
21
>>> np.prod(m)
720
Use sort
to put the elements of a 1D array in order:
>>> a = np.array([1, 2, 3, 4, 5, 4, 3, 2, 1])
>>> np.sort(a)
array([1, 1, 2, 2, 3, 3, 4, 4, 5])
And if you apply sort
to a multi-dimensional array
, it will return each sub-array ordered:
>>> m = np.array([[9, 4, 2], [1, 0, -3]])
>>> m
array([[ 9, 4, 2],
[ 1, 0, -3]])
>>> np.sort(m)
array([[ 2, 4, 9],
[-3, 0, 1]])
A related function is argsort
, which instead returns the indices of the sorted elements:
>>> x = np.array([2, 1, 4, 3, 5])
>>> np.argsort(x)
[1 0 3 2 4]
Use clip
if you want to set the max and min value allowed in your array:
>>> a = np.array([1, 2, 3, 0, -32, 99, 999])
>>>
>>> np.clip(a, 0, 10000)
array([ 1, 2, 3, 0, 0, 99, 999])
>>> np.clip(a, -999, 1)
array([ 1, 1, 1, 0, -32, 1, 1])
This simple goes through your array and converts any values below your MIN
to the MIN
and converts any values above your MAX
to MAX
.
You could convert a numpy.array
to a standard Python list
using list()
:
>>> a = np.array([1, 4, 1, 5, 9])
>>> a
array([1, 4, 1, 5, 9])
>>> list(a)
[1, 4, 1, 5, 9]
But this might not behave like you expect for a multidimensional array
. It just returns list of arrays:
>>> m = np.array([[1, 2, 3], [7, 8, 9]])
>>> m
array([[1, 2, 3],
[7, 8, 9]])
>>> list(m)
[array([1, 2, 3]), array([7, 8, 9])]
So, numpy
provides the tolist()
method, which will convert deep into the array
structure:
>>> m.tolist()
[[1, 2, 3], [7, 8, 9]]
There are two convenient methods for combining arrays in numpy, concatenate
and vstack
:
>>> import numpy as np
>>>
>>> a = np.array([1,2,3,4,5])
>>> b = np.array([9,8,7,6,5])
>>>
>>> np.concatenate((a, b))
array([1, 2, 3, 4, 5, 9, 8, 7, 6, 5])
>>>
>>> np.vstack((a, b))
array([[1, 2, 3, 4, 5],
[9, 8, 7, 6, 5]])
Both of these methods work on multi-dimensional arrays as well. Though higher dimensional math is always more fun:
>>> c = np.array([[1,2,3], [4,5,6]])
>>> d = np.array([[5,6,7], [8,9,0]])
>>>
>>> np.concatenate((c, d))
array([[1, 2, 3],
[4, 5, 6],
[5, 6, 7],
[8, 9, 0]])
>>>
>>> np.vstack((c, d))
array([[1, 2, 3],
[4, 5, 6],
[5, 6, 7],
[8, 9, 0]])
Well, now that we've seen the basics of NumPy arrays let's try using them for something.
NumPy also has a lot of tools built in to help you generate random numbers. We will not cover the topic of random number generation in detail, as it is a whole field onto itself. If this topic interests you, start your research here. There are many different distributions of random numbers, and though we will only cover two, there are many more supported by NumPy that you can read about in the documentation.
When we say a distribution of random numbers is flat, we mean that the numbers generated are evenly distributed between the minimum and maximum. In NumPy, the default minimum is 0.0 (inclusive) and the default maximum 1.0 (exclusive), when generating random decimals.
Use random.rand
to fill a NumPy array
with random float64
values between 0.0 and 1.0:
>>> np.random.rand(1)
array([ 0.05895439])
>>> np.random.rand(3)
array([ 0.3581962, 0.5377904, 0.0094921])
>>> np.random.rand(2, 3)
array([[ 0.35675058, 0.51579755, 0.03851769],
[ 0.74684991, 0.55219055, 0.37000399]])
Use random.randint
to fill a NumPy array
with random int64
values, where you can set the min and max integers, as well as the array
dimensions.
You can just provide a maximum integer (min defaults to zero, max is exclusive):
>>> np.random.randint(5)
0
>>> np.random.randint(5)
4
>>> np.random.randint(5)
3
>>> np.random.randint(5)
2
Or you can provide a min and a max (min inclusive, max exclusive):
>>> np.random.randint(5, 10)
9
>>> np.random.randint(5, 10)
5
>>> np.random.randint(5, 10)
5
>>> np.random.randint(5, 10)
7
Or you can create an entire array of random integers by providing the dimensions of the array as a third parameter:
>>> np.random.randint(1, 10, 3)
array([5, 2, 9])
>>>
>>> np.random.randint(5, 10, (2, 3))
array([[5, 6, 9],
[8, 9, 6]])
>>>
>>> np.random.randint(1, 10, (3, 5))
array([[5, 4, 7, 1, 4],
[6, 5, 5, 5, 4],
[6, 9, 8, 7, 1]])
You can use random.choice
to select an element from a 1D array
(multidimensional arrays won't work):
>>> a = np.array([1, 2, 3, 4, 5, 6, 7])
>>> np.random.choice(a)
5
>>> np.random.choice(a)
7
>>> np.random.choice(a)
1
The choice
function is part of a flat distribution, because each element in the list is equally likely to be selected.
When random numbers are generated with a Normal Distribution, the average value is zero but the numbers can be decimals anywhere from infinity to negative infinity. In NumPy, the standard deviation of the normal distribution of random numbers is 1:
Use np.random.randn
to produce an np.array
of np.float64
values, with a Normal Distribution (centered around zero, with a standard deviation of 1):
>>> np.random.randn(1)
array([ 0.82712644])
>>> np.random.randn(4)
array([-0.0518932 , 1.02017916, -0.50273024, 0.63187314])
And, again, we can create higher-dimensional arrays:
>>> np.random.randn(2, 4)
array([[-0.1366172 , -0.41921541, 1.98640058, -0.75165991],
[ 1.69984245, 0.65345415, -1.90558238, -0.41176329]])
>>>
>>> np.random.randn(2, 2, 2)
array([[[ 0.16383478, -0.03612812],
[ 0.03078127, 0.54628765]],
[[ 0.23479626, 1.0837927 ],
[-0.50655975, -0.6393057 ]]])
A common desire is to randomly order an existing sequence of values. NumPy provides two basic ways to do that.
Use random.shuffle
if you want to randomly switch all the elements in a NumPy array
in place:
>>> a = np.array([1, 2, 3, 4, 5])
>>> np.random.shuffle(a)
>>> a
array([4, 1, 5, 3, 2])
>>> np.random.shuffle(a)
>>> a
array([1, 3, 5, 2, 4])
The caveat here is that this shuffling is not deep. For a multi-dimensional array
, it will only shuffle the outermost arrays:
>>> m = np.array([[1, 2, 3], [4, 5, 6]])
>>> m
array([[1, 2, 3],
[4, 5, 6]])
>>> np.random.shuffle(m)
>>> m
array([[4, 5, 6],
[1, 2, 3]])
>>> m
array([[1, 2, 3],
[4, 5, 6]])
Use permutation
if you don't want to alter the original array
, but just create a randomized version of it:
>>> a = np.array([1, 2, 3, 4, 5])
>>> m = np.array([[1, 2, 3], [4, 5, 6]])
>>>
>>> np.random.permutation(a)
array([3, 4, 1, 2, 5])
>>> np.random.permutation(a)
array([2, 4, 1, 3, 5])
>>>
>>> np.random.permutation(m)
array([[1, 2, 3],
[4, 5, 6]])
>>> np.random.permutation(m)
array([[4, 5, 6],
[1, 2, 3]])
>>>
>>> a
array([1, 2, 3, 4, 5])
>>> m
array([[1, 2, 3],
[4, 5, 6]])
The difference between random.shuffle
and random.permutation
is very similar to the difference we saw between .sort()
and sorted()
for lists. The first one alters the sequence "in place", and the second one doesn't alter the sequence, but creates an altered version of it.
Oh no.
This class is meant to give an introduction and foundation to NumPy, not cover all the deep corners of the library. NumPy has a lot more tools that you might find useful: treating 2D arrays as matricies, Fourier transforms, polynomials, linear algebra, and statistics. But as long as you take the time to understand the numpy array and the numpy data types, the rest of the library should be approachable.
We will cover NumPy statistics in the SciPy class. For a full reference on what is available in NumPy, look in the official documentation.