Shows the code editor in Python

In my previous post, I gave an introduction to Python Libraries for Data Engineering and Data Science. In this post, we will have a first look at NumPy, one of the most important libraries to work with in Python.

NumPy is the simplest library for working with data. It is often re-used by other libraries such as Pandas, so it is necessary to first understand NumPy. The focus of this library is on easy transformations of Vectors, Matrizes and Arrays. It provides a lot of functionality on that. But let’s get our hands dirty with the library and have a look at it!

Before you get started, please make sure to have the Sandbox setup and ready

Getting started with NumPy

First of all, we need to import the library. This works with the following import statement in Python:

import numpy as np

This should now give us access to NumPy libraries. Let us first create an 3-dimensional array with 5 values in it. In NumPy, this works with the “arange” method. We provide “15” as the number of items and then let it re-shape to 3×5:

vals = np.arange(15).reshape(3,5)

This should now give us an output array with 2 dimensions, where each dimension contains 5 values. The values range from 0 to 14:

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

NumPy contains a lot of different variables and functions. To have PI, you simply import “pi” from numpy:

from numpy import pi

We can now use PI for further work and calculations in Python.

Simple Calculations with NumPy

Let’s create a new array with 5 values:

vl = np.arange(5)

An easy way to calculate is to calculate something to the power. This works with “**”

nv = vl**2

Now, this should give us the following output:

array([ 0,  1,  4,  9, 16])

The same applies to “3”: if we want to calculate everything in an array to the power of 3:

nn = vl**3

And the output should be similar:

array([ 0,  1,  8, 27, 64])

Working with Random Numbers in NumPy

NumPy contains the function “random” to create random numbers. This method takes the dimensions of the array to fit the numbers into. We use a 3×3 array:

nr = np.random.random((3,3))
nr *= 100

Please note that random returns numbers between 0 and 1, so in order to create higher numbers we need to “stretch” them. We thus multiply by 100. The output should be something like this:

array([[90.30147522,  6.88948191,  6.41853222],
       [82.76187536, 73.37687372,  9.48770728],
       [59.02523947, 84.56571797,  5.05225463]])

Your numbers should be different, since we are working with random numbers in here. We can do this as well with a 3-dimensional array:

n3d = np.random.random((3,3,3))
n3d *= 100

Also here, your numbers would be different, but the overall “structure” should look like the following:

array([[[89.02863455, 83.83509441, 93.94264059],
        [55.79196044, 79.32574406, 33.06871588],
        [26.11848117, 64.05158411, 94.80789032]],

       [[19.19231999, 63.52128357,  8.10253043],
        [21.35001753, 25.11397256, 74.92458022],
        [35.62544853, 98.17595966, 23.10038137]],

       [[81.56526913,  9.99720992, 79.52580966],
        [38.69294158, 25.9849473 , 85.97255179],
        [38.42338734, 67.53616027, 98.64039687]]])

Other means to work with Numbers in Python

NumPy provides several other options to work with data. There are several aggregation functions available that we can use. Let’s now look for the maximum value in the previously created array:


In my example this would return 98.6. You would get a different number, since we made it random. Also, it is possible to return the maximum number of a specific axis within an array. We therefore add the keyword “axis” to the “max” function:


This would now return the maximum number for each of the axis within the array. In my example, the results look like this:

array([[93.94264059, 79.32574406, 94.80789032],
       [63.52128357, 74.92458022, 98.17595966],
       [81.56526913, 85.97255179, 98.64039687]])

Another option is to create the sum. We can do this by the entire array, or by providing the axis keyword:


In the next sample, we make the data look more pretty. This can be done by rounding the numbers to 2 digits:


Iterating arrays in Python

Often, it is necessary to iterate over items. In NumPy, this can be achieved by using the built-in iterator. We get it by the function “nditer”. This function needs the array to iterate over and then we can include it in a for-each loop:

or val in np.nditer(n3d):

The above sample would iterate over all values in the array and then prints the values. If we want to modify the items within the array, we need to set the flag “op_flags” to “readwrite”. This enables us to do modifications to the array while iterating it. In the next sample, we iterate over each item and then create the modulo of 3 from it:

n3d = n3d.round(0)

with np.nditer(n3d, op_flags=['readwrite']) as iter:
    for i in iter:
        i[...] = i%3

These are the basics of NumPy. In our next tutorial, we will have a look at Pandas: a very powerful dataframe library.

If you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.