Skip to article frontmatterSkip to article content

Chapter 4.3 - numpy

Thus far, we have explored the “built-in” functionality of Python.

However, there is a large ecosystem of packages out there that do almost any task you can think of.

numpy (pronounced num-pie) is one of those extremely useful packages.

Why numpy?

You can think of numpy as a “supercharged” way of managing list data types.

Under the hood, the method calls you will be using are powered by very fast programming languages (C/C++/Cython). What numpy does is basically pass your information to those tools and then waits for the result. Once completed, the result is made available in regular Python format.

Due to this difference between lists and numpy data types, calculations on numerical data, particularly large amounts of numerical data, can be completed in a fraction of the time.

We can use the built-in %timeit functionality in Jupyter to demonstrate how lists and numpy compare when calculating the sum of a large amount of numbers.

First, build the list of integers and an numpy.ndarray of integers:

python_list = list(range(10000))

print(python_list[:10]) # first ten ints
print(python_list[-10:]) # last ten ints
print(len(python_list)) # length of python list
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[9990, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999]
10000

Chapter 4.3.1 - Importing numpy

To use numpy, you first need to import the package (like we did with classes).

When you use the as keyword, you are creating an alias. In almost all examples you will see online, the alias for numpy is np.

The combination looks like this:

import numpy as np

Which is equivalent to

import numpy

But instead of typing out numpy every time to use with dot notation, you just type np

For example, if you wanted to import numpy to use its arange method, you can run the following code with the np alias:

import numpy as np

numbers = np.arange(100)

Similarly, you can just import numpy without the alias:

import numpy

numbers = numpy.arange(100)

Both are correct, and do the same thing, but the first example requires less typing, is more convenient, and matches up with most online examples and codebase formatting choices.

Once imported, you can use dot notation to access numpy methods.

One such method named arange is very similar to Python’s range() method.

In other words,

np.arange(10000)

is the same as

list(range(10000))

except the first case will give you an numpy.ndarray datatype, instead of a list data type. This is important, because numpy.ndarray is much more flexible than list when it comes to complex calculations. There are certainly cases where you will still use list in your career, but often it is just a preprocessing (i.e., preparing and cleaning data) step before converting the list to a numpy.ndarray (i.e., casting!). This is very easy to do and I will show examples later in the chapter.

Chapter 4.3.2 - The speed of numpy methods vs. built-in functions

The following example creates a numpy.ndarray of int that range from 0 to 9999. numpy is also not inclusive on the last number for arange. So I pass it 10000 to be equivalent with range that does not include the last number (i.e., 10000). The same slicing and indexing rules we learned with list generally apply (i.e., the last number in the slice in both list and numpy.ndarray is not inclusive), with some exceptions / nuances we will discuss and examine later.

import numpy as np

numpy_array = np.arange(10000)

print(numpy_array[:10]) # first ten ints
print(numpy_array[-10:]) # last ten ints
print(len(numpy_array)) # length of numpy list
[0 1 2 3 4 5 6 7 8 9]
[9990 9991 9992 9993 9994 9995 9996 9997 9998 9999]
10000

Next, use %timeit to calculate the sum of a Python list:

%timeit sum(python_list)
47.8 μs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Similarly, use %timeit to calculate the sum of an numpy.ndarray.

In this case, we import numpy with the alias of np and use its sum method. This method works just like the sum method we used before--it adds up all of the values in a list and returns that value to you.

import numpy as np

%timeit np.sum(numpy_array)
1.44 μs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

For this simple example, the numpy version over 5 times faster, on average, compared to list.

A difference of ~100 microseconds (0.0001 seconds) might not mean much to you, but imagine doing this same calculation 1 billion times.

This might seem like an impossible situation, but consider this. A ~4-km WRF (Weather Research and Forecasting Model) simulation containing the CONUS can have over 1 million grids, each with a float value.

A climate change dataset called “WRF BCC” used by researchers in our department has simulation grid with ~1 million grid points. We have output every 15 minutes, and our study periods span 75 years.

  • 1x10641 x 10^{6} * 4 = 4 million floats per hour

  • 4x106244 x 10^{6} * 24 = 96 million floats per day

  • 9.6x1073659.6 x 10^{7} * 365 = 35 billion floats per year

  • 3.5x1010753.5 x 10^{10} * 75 = 2.6 trillion floats in the dataset

In other words, if you would like to run a simple calculation like a temperature conversion on the entire dataset, you would need to run that calculation 2.6 trillion times. This can really add up the microseconds quickly! And trust me when I say, datasets are not getting any smaller.

In other words, if a Python solution for task on “WRF BCC” took 16 microseconds for numpy.ndarray, it would take hundreds of days to complete. If it took 113 microseconds for list, it would take thousands of days. Luckily there are even better solutions than numpy that I will talk about in my course EAE 483 / 583 - Data Science for the Geosciences

mean comparison

Other cases can produce an even larger speed up. We can try a mean calculation, which could give us close to a 10x speed up!

import numpy as np

# create python list of 10,000 numbers
python_list = list(range(10000))

# create numpy array of 10,000 numbers
numpy_array = np.arange(10000)

# Python built-in mean:
%timeit sum(python_list) / len(python_list)

# numpy mean:
%timeit np.mean(numpy_array)
47.3 μs ± 653 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
5.33 μs ± 62.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Chapter 4.3.3 - Basic numpy usage

You only need to import once in your Notebook:

import numpy as np

Finding the mean of a list of numbers

numpy can interface with Python lists in many cases.

For example, you can pass a list as an argument to the numpy.mean() method:

a = [1, 2, 3, 4, 5]

print("mean of", a, "=", np.mean(a))
mean of [1, 2, 3, 4, 5] = 3.0

We can examine the mean method by using the help() function. This usually gives examples on how to use the method at the end of the “doc string”:

help(np.mean)
Help on _ArrayFunctionDispatcher in module numpy:

mean(
    a,
    axis=None,
    dtype=None,
    out=None,
    keepdims=<no value>,
    *,
    where=<no value>
)
    Compute the arithmetic mean along the specified axis.

    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.

    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.

        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs, the default
        is `float64`; for floating point inputs, it is the same as the
        input dtype.
    out : ndarray, optional
        Alternate output array in which to place the result.  The default
        is ``None``; if provided, it must have the same shape as the
        expected output, but the type will be cast if necessary.
        See :ref:`ufuncs-output-type` for more details.
        See :ref:`ufuncs-output-type` for more details.

    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the input array.

        If the default value is passed, then `keepdims` will not be
        passed through to the `mean` method of sub-classes of
        `ndarray`, however any non-default value will be.  If the
        sub-class' method does not implement `keepdims` any
        exceptions will be raised.

    where : array_like of bool, optional
        Elements to include in the mean. See `~numpy.ufunc.reduce` for details.

        .. versionadded:: 1.20.0

    Returns
    -------
    m : ndarray, see dtype parameter above
        If `out=None`, returns a new array containing the mean values,
        otherwise a reference to the output array is returned.

    See Also
    --------
    average : Weighted average
    std, var, nanmean, nanstd, nanvar

    Notes
    -----
    The arithmetic mean is the sum of the elements along the axis divided
    by the number of elements.

    Note that for floating-point input, the mean is computed using the
    same precision the input has.  Depending on the input data, this can
    cause the results to be inaccurate, especially for `float32` (see
    example below).  Specifying a higher-precision accumulator using the
    `dtype` keyword can alleviate this issue.

    By default, `float16` results are computed using `float32` intermediates
    for extra precision.

    Examples
    --------
    >>> import numpy as np
    >>> a = np.array([[1, 2], [3, 4]])
    >>> np.mean(a)
    2.5
    >>> np.mean(a, axis=0)
    array([2., 3.])
    >>> np.mean(a, axis=1)
    array([1.5, 3.5])

    In single precision, `mean` can be inaccurate:

    >>> a = np.zeros((2, 512*512), dtype=np.float32)
    >>> a[0, :] = 1.0
    >>> a[1, :] = 0.1
    >>> np.mean(a)
    np.float32(0.54999924)

    Computing the mean in float64 is more accurate:

    >>> np.mean(a, dtype=np.float64)
    0.55000000074505806 # may vary

    Computing the mean in timedelta64 is available:

    >>> b = np.array([1, 3], dtype="timedelta64[D]")
    >>> np.mean(b)
    np.timedelta64(2,'D')

    Specifying a where argument:

    >>> a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
    >>> np.mean(a)
    12.0
    >>> np.mean(a, where=[[True], [False], [False]])
    9.0

sorting a list: np.sort(...)

numpy can seamlessly work with Python lists. In this example, you can see that numpy will sort a list without needing to type cast beforehand.

a = [3, 2, 1, 5, 4]

print("sorted", a, "=", np.sort(a))
sorted [3, 2, 1, 5, 4] = [1 2 3 4 5]

finding the maximum value in a list: np.max(...)

a = [3, 2, 1, 5, 4]

print(np.max(a))
5

finding the minimum value in a list: np.min(...)

a = [3, 2, 1, 5, 4]

print(np.min(a))
1

Other methods you can use for simple calculations on a list:

  1. np.sum(a)

    finds the sum of all numbers in a

  2. np.median(a)

    finds the median of all numbers in a

  3. np.unique(a)

    finds all unique values in a (similar to set)

  4. np.std(a)

    finds the standard deviation of numbers in a

Chapter 4.3.4 - Easy math operations with np.array

Numpy arrays allow fast, elementwise mathematical operations.

Here is how you would multiply every number in a Python list by 10:

old = [1, 2, 3, 4, 5]

print("before multiplication", old)

for i in range(len(old)):
    
    old[i] = old[i] * 10

print("After math", old)
before multiplication [1, 2, 3, 4, 5]
After math [10, 20, 30, 40, 50]

You would have to visit every index, multiply itself by 10, and then set the result to the index position.

Compare this to numpy, where the only extra step is converting a python list to a numpy.ndarray

np.array is a method that has one argument. The argument should be a composite data type (like a list)the function ‘returns’ a numpy array representation of that list.

In this case, the value returned by np.array is set equal to the variable a.

a = np.array([1, 2, 3, 4, 5]) 

print("Before math", a)

a = a * 10 # multiplies every value in 'a' by 10, automatically!

print("After math", a)
Before math [1 2 3 4 5]
After math [10 20 30 40 50]

Although they give the same result, numpy is much easier to work with because it is one line with no loops!

Chapter 4.3.5 - numpy indexing

Python list and numpy arrays have the same indexing approach:

  1. Indexing: Get the 3rd value in a ....

Python listNumpy array
my_list[2]my_array[2]
  1. Indexing: Get the 2nd through 3rd values in a...

Python listNumpy array
my_list[1:3]my_array[1:3]
  1. Indexing: Skip every other value in a...

Python listNumpy array
my_list[::2]my_array[::2]
  1. Indexing: Reverse the order of a...

Python listNumpy array
my_list[::-1]my_array[::-1]

Chapter 4.3.6 - N-dimensional data

If you have more than 1 dimension, numpy can handle visiting every index location automatically.

To work with a 2D python list, you would have to use nested for loops:

a = [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]

print("Before multiply", a)

for i in range(2):
    for j in range(5):
        a[i][j] = a[i][j] * 10

print("After multiply", a)
Before multiply [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]
After multiply [[10, 20, 30, 40, 50], [10, 20, 30, 40, 50]]

This is much easier to deal with using numpy:

import numpy as np

a = [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]
a_n = np.array(a)

print("Before multiply", a_n)

a_n = a_n * 10

print("After multiply", a_n)
Before multiply [[1 2 3 4 5]
 [1 2 3 4 5]]
After multiply [[10 20 30 40 50]
 [10 20 30 40 50]]