Apache Spark Tutorial: Data Transformations on RDDs: map and intersect keywords

In our last tutorial section, we looked at more data transformations with Spark. In this tutorial, we will continue with data transformations on Spark RDD before we move on to Actions. Today, we will focus on map and intersect keywords and apply them to Spark RDDs.

Intersect

A intersection between two different datasets only returns that values that – well – intersect. This means that only those values are returned if their values correspond. Only one element is returned, even though there are multiple elements in both datasets. Intersection is used like this:

ds_one = sc.parallelize([("Mark", 1984), ("Lisa", 1985)])
ds_two = sc.parallelize([("Mark", 1984), ("Anastasia", 2017)])
sorted(ds_one.intersection(ds_two).collect())

The output should look like this now:

The Intersection keyword

Please note that “Mark” is only considered by the intersection keyword if all keys match. In our case, this means the year and name.

Map

Map returns a new RDD, that was transformed by applying a function to it. This is very useful if you want to check a dataset for different things or modify data. You can also add entries by this. The following example is using a little more complex transformation where we create our own function and calculate the age and gender of the person:

import datetime
def doTheTrans(values):
    age = datetime.datetime.now().year - values[1]
    gender = ""
    if values[0].endswith("a"):
        gender = "f"
    else:
        gender = "m"
    return ([values[0], age, gender])
ds_one = sc.parallelize([("Mark", 1984), ("Lisa", 1985)])
sorted(ds_one.map(doTheTrans).collect())

What happens in the code?

First, we import datetime. We need this to get the current year (we could also hardcode “2019”, but then I would have to re-write the entire tutorial in january!). Next, we create our function which we call “doTheTrans”. This gets tuples from the map function (which we will use later). Note that tuples are immutable, therefore we need to create new variables and return a new tuple.

First, we calculate the age. This is easily done by “datetime.now().year – values[1]”. We have the year the person is born in the second index (number 1) in the tuple. Then, we check if the name ends with “a” and define if the person is female or male. Please note here that this isn’t sufficient normally, we just use it for our sample.

Once we are done with this, we can call the “map” function with the function we have used. Now, the map function applies the new function which we created to each item in the rdd.

Now, the output of this should look like the following:

[['Lisa', 34, 'f'], ['Mark', 35, 'm']]

Data Transformations on Spark with a Lambda Function

It is also possible to write the function inline as lambda, if your transformations aren’t that complex. Suppose we only want to return the age of a person; here, the code would look like the following:

sorted(ds_one.map(lambda x: (x[0], datetime.datetime.now().year - x[1] +1)).collect())

And the output should be this:

[('Lisa', 35), ('Mark', 36)]

Now, after 3 intense tutorials on Data transformations on Spark RDD, we will focus on Actions in Spark in the next tutorial.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

Python for Spark Tutorial – Python Async and await functionality

In the last tutorials, we had a look at methods, classes and deorators. Now, let’s have a brief look at asynchronous operations in Python. Most of the time, this is anyway abstracted for us via Spark, but it is nevertheless relevant to have some basic understanding of it. In this Tutorial, we will look at Python async and await functionality.

Python Async and await functionality

Basically, you define a method to be asynchronous by simply adding “async” as keyword ahead of the method definition. This is written like that:

async def FUNCTION_NAME():

FUNCTION-BLOCK

Another keyword in that context is “await”. Basically, every function that is doing something asynchronous is awaitable. When adding “await”, nothing else happens until the asynchronous function has finished. This means that you might loose the benefit of asynchronous execution but get better handling when working with web data. In the following code, we create an async function that sleeps some seconds (between 1 and 10). We call the function twice with the “await” operator.

import asyncio
import random
async def func():
    tim = random.randint(1,10)
    await asyncio.sleep(tim)
    print(f"Function finished after {tim} seconds")
await func()
await func()

In the output, you can see that it was first waited for the first function to finish and only then the second one was executed. Basically, all of the execution happened sequentially, not in parallel.

Function finished after 9 seconds
Function finished after 9 seconds

Python also knows parallel execution. This is done via Tasks. We use the Method “create_task” from the asyncio library in order to execute a function in parallel. In order to see how this works, we invoke the function several times and add a print-statement at the end of the code.

Parallel execution in Python async

asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
asyncio.create_task(func())
print("doing something else ...")

This now looks very different to the previous sample. The print statement is the first to show up, and all code path finish after 9 seconds max. This is due to the fact that (A) the first execution finishes after 1 second – thus the print statement is the first to be shown, since it is executed immediately. (B) Everything is executed in parallel and the maximum sleep interval is 9 seconds.

doing something else ...
Function finished after 1 seconds
Function finished after 1 seconds
Function finished after 3 seconds
Function finished after 4 seconds
Function finished after 5 seconds
Function finished after 7 seconds
Function finished after 7 seconds
Function finished after 7 seconds
Function finished after 8 seconds
Function finished after 10 seconds
Function finished after 10 seconds
Function finished after 10 seconds

However, there are also some issues with async operations. You can never say how long it takes a task to execute. It could finish fast or it could also take forever, due to a weak network connection or an overloaded server. Therefore, you might want to specify a timeout, which is the maximum an operation should be waited for. In Python, this is done via the “wait_for” method. It basically takes the function to execute and the timeout in seconds. In case the call runs into a timeout, a “TimeoutError” is raised. This allows us to surround it with a try-block.

Dealing with TimeoutError in Python

try:
    await asyncio.wait_for(func(), timeout=3.0)
except asyncio.TimeoutError:
    print("Timeout occured")

In two third of the cases, our function will run into a timeout. The function should return this:

Timeout occured

Each task that should be executed can also be controlled. Whenever you call the “create_task” function, it returns a Task-object. A task can either be done, cancelled or contain an error. In the next sample, we create a new task and wait for it’s completion. We then check if the task was done or cancelled. You could also check for an error and retrieve the error message from it.

Create_Task in Python

task = asyncio.create_task(func())
print("running task")
await task
if task.done():
    print("Task was done")
elif task.cancelled():
    print("Task was cancelled")

In our case, no error should have occurred and thus the output should be the following:

running task
Function finished after 8 seconds
Task was done

Now we know how to work with async operations in Python. In our next tutorial, we will have a deeper look into how to work with Strings.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

Data DevOps: How to do BizDevOps for Analytics and Data Science

Agility is almost everywhere and it also starts to get more into other hyped domains – such as Data Science. One thing which I like in this respect is the combination with DevOps – as this eases up the process and creates end-to-end responsibility. However, I strongly believe that it doesn’t make much sense to exclude the business. In case of Analytics, I would argue that it is BizDevOps.

There is a huge demand for Data DevOps nowadays. Basically, Data Science needs a lot of business integration and works throughout different domains and functions. I outlined several times and in different posts here, that Data Science isn’t a job that is done by Data Scientists. It is more of a team work, and thus needs different people. With the concept of BizDevOps, this can be easily explained; let’s have a look at the following picture and I will afterwards outline the interdependencies on it.

The process for Data Science: BizDevOps is the answer

BizDevOps for Data Science

Basically, there must be exactly one person that takes the end-to-end responsibility – ranging from business alignments to translation into an algorithm and finally in making it productive by operating it. This is basically the typical workflow for BizDevOps. This one person taking the end-to-end responsibility is typically a project or program manager working in the data domain. The three steps were outlined in the above figure, let’s now have a look at each of them.

Data DevOps: Biz

The program manager for Data (or – you could also call this person the “Analytics Translator”) works closely with the business – either marketing, fraud, risk, shop floor, … – on getting their business requirements and needs. This person has a great understanding of what is feasible with their internal data as well in order to be capable of “translating a business problem to an algorithm”. In here, it is mainly about the Use-Case and not so much about tools and technologies. This happens in the next step. Until here, Data Scientists aren’t necessarily involved yet.

Data DevOps: Dev

In this phase, it is all about implementing the algorithm and working with the Data. The program manager mentioned above already aligned with the business and did a detailed description. Also, Data Scientists and Data Engineers are integrated now. Data Engineers start to prepare and fetch the data. Also, they work with Data Scientists in finding and retrieving the answer for the business question. There are several iterations and feedback loops back to the business, once more and more answers arrive. Anyway, this process should only take a few weeks – ideally 3-6 weeks. Once the results are satisfying, it goes over to the next phase – bringing it into operation.

Data DevOps: Ops

This phase is now about operating the algorithms that were developed. Basically, the data engineer is in charge of integrating this into the live systems. Basically, the business unit wants to see it as (continuously) calculated KPI or any other action that could result in some sort of impact. Also, continuous improvement of the models is happening there, since business might come up with new ideas on it. In this phase, the data scientist isn’t involved anymore. It is the data engineer or a dedicated devops engineer alongside the program manager.

Eventually, once the project is done (I dislike “done” because in my opinion a project is never done), this entire process moves into a CI process.

This post is part of the “Big Data for Business” tutorial. Our focus was on Data DevOps and BizDevOps. In this tutorial, I explain various aspects of handling data right within a company. I also recommend you to read about the concept of DevOps.

Python for Spark Tutorial – Python decorator – Part 2

Decorators are powerful things in most programming languages. They help us making code more readable and adding functionality to a method or class. Basically, decorators are added above the method or class declaration in order to create some behaviour. Basically, we differentiate between two kind of decorators: method decorators and class decorators. In this tutorial we will look at the class and how a Python decorator works.

The Python decorator for a class

Class decorators are used to add some behaviour to a class. Normally, you would use this when you want to add some kind of behaviour to a class that is outside of its inheritance structure – e.g. by adding something that is too abstract to bring it to the inheritance structure itself.

The definition of that is very similar to the method decorators:

@DECORATORNAME
class CLASSNAME():
CLASS-BLOCK

The decorator definition is also very similar to the last tutorial’s sample. We first create a method that takes a class and then create the inner method. Within the inner method, we create a new function that we want to “append” to the class. We call this method “fly” that simply prints “Now flying …” to the console. To add this function to the class, we call the “setattr” function of Python. We then return the class and the class wrapper.

def altitude(cls):
    def clswrapper(*args):
        def fly():
            print("Now flying ... ")
        setattr(cls, "fly", fly)
        return cls
    return clswrapper

How to use the Python decorator

Now, our decorator is ready to be used. We first need to create a class. Therefore, we re-use the sample of the vehicles, but simplify it a bit. We create a class “Vehicle” that has a function “accelerate” and create two sub classes “Car” and “Plane” that both inherit from “Vehicle”. The only difference now is that we add a decorator to the class “Plane”. We want to add the possibility to fly to the Plane.

class Vehicle:
    speed = 0
    def accelerate(self, speed):
        self.speed = speed
class Car(Vehicle):
    pass
@altitude
class Plane(Vehicle):
    pass

Now, we want to test our output:

c = Car()
p = Plane()
c.accelerate(100)
print(c.speed)
print(p.fly())

Output:

100
Now flying ... 

Basically, there are a lot of scenarios when you would use class decorators. For instance, you can add functionality to classes that contain data in order to convert this into a more readable table or alike.

In our next tutorial, we will look at the await-operator.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

Apache Spark Tutorial: Data Transformations on RDDs – Part 2

In our last tutorial section, we looked at filtering, joining and sorting data. Today, we will look at more Spark Data Transformations on RDD. Our focus topics for today are: distinct, groupby and union in Spark. Now, we will have a look at several other operators for that. First, we will use the “Distinct” transformation

Distinct

Distinct enables us to return exactly one of each item. For instance, if we have more than one entry in the same sequence, we can reduce this. A sample would be the following array:

[1, 2, 3, 4, 1]

However, we only want to return each number exactly once. This is done via the distinct keyword. The following example illustrates this:

ds_distinct = sc.parallelize([(1), (2), (3), (4), (1)]).distinct().collect()
ds_distinct

GROUPBY

A very important task when working with data is grouping. In Spark, we have the GroupBy transformation for this. In our case, this is “GroupByKey”. Basically, this groups the dataset into a specific form and the execution is added when calling the “mapValues” function. With this function, you can provide how you want to deal with the values. Some options are pre-defined, such as “len” for the number of occurrences or “list” for the actual values. The following sample illustrates this with our dataset introduced in the previous tutorial:

ds_set = sc.parallelize([("Mark", 1984), ("Lisa", 1985), ("Mark", 2015)])
ds_grp = ds_set.groupByKey().mapValues(list).collect()
ds_grp

If you want to have a count instead, simply use “len” for it:

ds_set.groupByKey().mapValues(len).collect()

The output should look like this now:

The GroupByKey keyword in Apache Spark
The GroupByKey keyword in Apache Spark

Union

A union joins together two datasets into one. In contrast to the “Join” transformation that we already looked at in our last tutorial, it doesn’t take any keys and simply appends the datasets. It is very similar to the previous one, but the result is different. The syntax for it is straight forward: it is written “dsone.union(dstwo)”. Let’s have a look at it:

ds_one = sc.parallelize([("Mark", 1984), ("Lisa", 1985)])
ds_two = sc.parallelize([("Luke", 2015), ("Anastasia", 2017)])
sorted(ds_one.union(ds_two).collect())

Now, the output of this should look like the following:

[('Anastasia', 2017), ('Lisa', 1985), ('Luke', 2015), ('Mark', 1984)]

Today, we learned a lot about Spark data transformations. In our next tutorial – part 3 – we will have a look at even more of them.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

Python for Spark Tutorial – Python decorator for methods

Decorators are powerful things in most programming languages. They help us making code more readable and adding functionality to a method or class. Basically, decorators are added above the method or class declaration in order to create some behaviour. Basically, we differentiate between two kind of decorators: method decorators and class decorators. In this tutorial, we will have a look at Python decorator for Methods.

Python decorators for Methods

Method decorators are used to perform some kind of behaviour on a method. For instance, you could add a stopwatch to check for performance, configure logging or make some checks on the method itself. All of that is done by “wrapping” the method into a decorator method. This basically means that the method “decorated” is executed in the decorator method. This, for instance, would allow us to surround a method with a try-catch block and thus add all exceptions occurred in a method into a global error handling tool.

The definition of that is very easy:

@DECORATORNAME
def METHODNAME():
METHOD-BLOCK

Basically, the only thing that you need is the “@” and the decorator name. There are several decorators available, but now we will create our own decorator. We start by creating a performance counter. The goal of that is to measure how long it takes a method to execute. We therefore create the decorator from scratch.

Basically, I stated that the decorator takes the function and executes it inside the decorator function. We start by defining our performance counter as function, that takes one argument – the function to wrap in. Within this function, we add another function (yes, we can do this in Python – creating inline functions!) – typically we call it either “wrapper” or “inner”. I call it “inner”. The inner function should provide the capability to pass on arguments; typically, a function call can have 0 to n arguments. In order to do this, we provide “*args” and “**kwargs”. Both mean that there is a variable number of arguments available. The only difference between args and kwargs is that kwargs are named arguments (e.g. “person = “Pete”).

The inner function of a Python decorator

In this inner function, we now create the start-variable that is the time once the performance counting should start. After the start-variable, we call the function (any function which we decorate) by passing on all the *args and **kwargs. After that, we measure the time again and do the math. Simple, isn’t it? However, we haven’t decorated anything yet. This is now done by creating a function that sleeps and prints text afterwards. The code for this is shown below.

import time
def perfcounter(func):
    def inner(*args, **kwargs):
        start = time.perf_counter()
        func(*args, **kwargs) #This is the invokation of the function!
        print(time.perf_counter() - start)
    return inner
@perfcounter
def printText(text):
    time.sleep(0.3)
    print(text)
printText("Hello Decorator")

Output:

Hello Decorator
0.3019062000021222

As you can see, we are now capable of adding this perfcounter decorator to any kind of function we like. Normally, it makes sense to add this to functions which take rather long – e.g. in Spark jobs or web requests. In the next sample, I create a type checker decorator. Basically, this type checker should validate that all parameters passed to any kind of function are of a specific type. E.g. we want to ensure that all parameters passed to a multiplication function are only of type integer, parameters passed to a print function are only of type string.

Basically, you could also do this check inline, but it is much easier if you write the function once and simply apply it to the function as a decorator. Also, it greatly decreases the number of code lines and thus increases the readability of your code. The decorator for that should look like the following:

@typechecker(int)

For integer values and

@typechecker(str)

for string values.

The only difference now is that the decorator itself takes parameters as well, so we need to wrap the function into another function – compared to the previous sample, another level is added.

What are the steps necessary?

  1. Create the method to get the parameter: def typechecker(type)
  2. Code the outer function that takes the function and holds the inner function
  3. Create the function block that holds the inner function and a type checker:
    1. We add a function called “isInt(arg)” that checks if the argument passed is of a specific type. We can use “isinstance” to check if an argument is of a specific type – e.g. int or str. If it isn’t of the expected type, we raise an error
    2. We add the inner function with args and kwargs. In this function, we iterate over all args and kwargs passed and check it against the above function (isInt). If all checks succeed, we invoke the wrapped function.
Sounds a bit complex? Don't worry, it isn't that complex at all. Let's have a look at the code:
def typechecker(type):
    def check(func):
        def isInt(arg):
            if not isinstance(arg, type):
                raise TypeError("Only full numbers permitted. Please check")
        def inner(*args, **kwargs):
            for arg in args:
                isInt(arg)
            for kwarg in kwargs:
                isInt(kwarg)
            return func(*args, **kwargs)
        return inner
    return check

Now, since we are done with the decorator itself, let’s decorate some functions. We create two functions. The first one multiplies all values passed to the function. The values can be of variable length. The second function prints all strings passed to the function. We decorate the two functions with the typechecker-decorator defined above.

@typechecker(int)
def mulall(*args):
    res = 0
    for arg in args:
        if res == 0: res = arg
        else: res *= arg
    return res
@typechecker(str)
def concat(*args):
    res = ""
    for arg in args:
        res += arg
    return res

The benefit of the Python decorator on methods

I guess you can now see the benefit of decorators. We can influence the behaviour of a function and create code-snippets that are re-usable. But now, let’s call the functions to see if our decorator works as expected. Note: the third invokation should produce an error 🙂

print(mulall(1,2,3))
print(concat("a", "b", "c"))
print(mulall(1,2,"a"))

Output:

6
abc

… and the error message:

TypeErrorTraceback (most recent call last)
<ipython-input-6-cd2213a0d884> in <module>
     35 print(mulall(1,2,3))
     36 print(concat("a", "b", "c"))
---> 37 print(mulall(1,2,"a"))
<ipython-input-6-cd2213a0d884> in inner(*args, **kwargs)
      7         def inner(*args, **kwargs):
      8             for arg in args:
----> 9                 isInt(arg)
     10
     11             for kwarg in kwargs:
<ipython-input-6-cd2213a0d884> in isInt(arg)
      3         def isInt(arg):
      4             if not isinstance(arg, type):
----> 5                 raise TypeError("Only full numbers permitted. Please check")
      6
      7         def inner(*args, **kwargs):
TypeError: Only full numbers permitted. Please check

I hope you like the Python decorator. In my opinion, they are very helpful and provide great value. In the next tutorial, I will show how class decorators work.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

The three data sources

The 3 degrees of Data Access for data scientists and users

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my opinion, they range from different expertise levels. Basically, I see three different user types for data access within a company

Data access on 3 different levels

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data access for Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success. With Data access, it is necessary to also incorporate role based access controls.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

Python for Spark Tutorial – Python exception handling – how to deal with errors

Now that we have substantial knowledge of Python, let’s look at one very important thing with software: error handling. Like most other languages, Python also provides error handling with exceptions. Let’s have a look in how Python exception handling works.

Python exception handling

Basically, each progress of error-handling starts with a try-statement. This is a block where an error might occur. If an error occurs in the try-block, an exception is raised in Python. The interpreter then looks if there is a surrounding exception handler. It is best to handle exceptions as detailed as possible, since it will prevent errors later in the program. Python has a huge list of pre-defined exceptions, so it is easy to handle them without the need for own exceptions. A try-Block might also have a finally-block. This is useful if you worked with files or opened some connections. In the finally-block, you can close the connections. Note that the finally-block is executed every time, independent of an error or not. The syntax for Exceptions is this:

try:
TRY-BLOCK
except ERRORNAME:
ERROR-BLOCK
finally:
FINAL-BLOCK

If you look for exceptions in the Python documentation, you have to look them up with the “Error” appending. Python doesn’t call them “Exceptions” – even though the base-class is called like that. In the following sample, we will create a division by zero error. Therefore, we define a method “divide” which takes two parameters. We surround the division with the try-statement and check for the ZeroDivisionError. Note that the finally block is executed in all calls to the method.

def divide(val1, val2):
    try:
        return val1 / val2
    except ZeroDivisionError:
        print("Division by zero - return 0 instead")
        return 0
    finally:
        print("Cleanup everything")
res = divide(2, 3)
res2 = divide(2, 0)
print("The result for res is: " + str(res) + " and for res2 it is: " + str(res2))

Output:

Cleanup everything
Division by zero - return 0 instead
Cleanup everything
The result for res is: 0.6666666666666666 and for res2 it is: 0

So, this was easy, wasn’t it? Now, let’s have a look at how to raise your own exceptions.

Raise exception in Python

Basically, an exception can be “thrown” with the “raise” statement. Python is much more modest with that. C-like languages throw exceptions at you, whereas Python just kindly raises one ;). The statement to raise an exception is written like this:

raise ERRORNAME:

In our next sample, we want to raise an error for a car that drives too fast. Therefore, we first need to create our own exception. All exceptions inherit from “Exception”. So, we first create a class that inherits from that. We call the error “TooFastError”. We add no further functionality and just write pass. This instructs Python to continue with other logic. We then define a function “accelerate”, which gets exactly one parameter – speed. If speed is higher than 100, we now raise our TooFastError. Let’s try it:

class TooFastError(Exception):
    pass
def accelerate(speed):
    if speed > 100:
        raise TooFastError("You can't drive at " + str(speed) + " the overall speed limit is 100!")
    else:
        print("Ok, let's go!")
accelerate(20)
accelerate(110)

Output:

Ok, let's go!
TooFastErrorTraceback (most recent call last)
<ipython-input-8-9be2666bf1f4> in <module>
      9
     10 accelerate(20)
---> 11 accelerate(110)
<ipython-input-8-9be2666bf1f4> in accelerate(speed)
      4 def accelerate(speed):
      5     if speed > 100:
----> 6         raise TooFastError("You can't drive at " + str(speed) + " the overall speed limit is 100!")
      7     else:
      8         print("Ok, let's go!")
TooFastError: You can't drive at 110 the overall speed limit is 100!

Isn’t it beautiful to raise your own exceptions :)? In our next tutorial we will have a look at decorators in Python and at the Dataclass.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

Python for Spark Tutorial – Working with Class and Inheritance in Python

In the last tutorial, we’ve learned about Methods in Python. To further increase the re-usability of your code, it is also possible to use a Class in Python. With Classes, you can encapsulate your methods into better re-usability.

The Class in Python

With a class, we can add several methods together. Imagine you write an app for selling different kind of vehicles – either a car, a bike or a motorcycle. There are several items in your vehicle, that are always the same – e.g. all would have a wheel. With a class, you can sum this up and create re-usability on each of them. Basically, a class in Python is also very similar to classes in C-like languages. However, there are several differences. The most striking one is that we don’t have any public-protected-private modifiers. This applies to the class itself and also to its methods and variables. Basically, a class is defined with the “class” keyword. The syntax of it looks like this:

class CLASSNAME:

SOME_VARIABLES

def __init__(self):
INIT-BLOCK

def METHODS:
METHODS-BLOCK

As you can see, we have several possibilities with classes. First of all, defining a class is easy. You would then specify some variables within the class. The constructor – one thing you often use in C-like languages – is written wit “def__init__(“. You will then have the possibility to specify variables in the brackets. One mandatory variable in there is the “self”. This refers to the class itself and makes all objects in the class accessible. You will also need it in all further method definitions.

Let’s go back to the previous sample with the vehicles. In order to achieve that, we create a class “Vehicle” and add some variables and methods to it. The Class should look like the following. Create a new Notebook and name this notebook “vehicle”.

class Vehicle:
    vname = "Not Defined"
    curspeed = 0
    def name(self):
        print(self.vname)
    def accelerate(self, speed):
        self.curspeed += speed

Now comes the tricky part: since we are working in a notebook application and not in a comprehensive IDE, we first need to download the file as “Python” and then re-upload it. If you would just rename the notebook, it would result in a json document that can’t be read. We re-upload the file and name it “vehicle.py”. Make sure that it is in the same folder as your other notebook. We then switch back to your original file and use the just created vehicle:

from vehicle import Vehicle
v = Vehicle()
v.name()

Output:

Not Defined

Note that the output is as expected: we explicitly named the vehicle “Not Defined” – so it isn’t an error. I guess you can imagine what I want to do with the vehicle – I want to use inheritance in the next sample 🙂

Inheritance in Python

Often, it is necessary to inherit from a class. This is helpful, when a sub-class is a specialisation of another class and thus has some common functions or behaviour. It is therefore easier to extend the behaviour without re-writing the code. To enable inheritance, we add brackets to the new class with the name of the parent-class. Also, it is important to import the parent class. Everything else stays the same.

In the following sample, we create a new class and name it “car”. Basically, we extend the behaviour of our vehicle and add a “start” method to it. This will print a statement that the car has started with a specific number of HP. Create a new file in Jupyter and name it “car”:

from vehicle import Vehicle
class Car(Vehicle):
    def __init__(self, brand, hp):
        self.hp = hp
        vname = brand
    def start(self):
        print("Engine of " + self.vname + " started @ " + str(self.hp))

Do the same steps as in the “vehicle” file and go back to your initial Notebook. We instantiate a new Car and give it a name and some HP:

from car import Car
c = Car("BMW", 150)
print(c.hp)
print(c.name())

Output:

150
Mercedes

The function “hp” belongs to the “car”-class, whereas the function “name” belongs to the “vehicle” class. Let’s now use the “start” function:

c.start()

Output:

Engine of Mercedes started @ 150

Also here we use the method of our “car” Class in Python.

Since now the code got more complex, it is necessary to think about how to work with Errors – in our next tutorial, we will look at error handling.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.