In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

AI and Ethics is a complex and ofthen discussed topic at different conferences, usergroups and forums. It even got picked up by the European commission. I would argue that it should actually go one step further: it should be part of every corporate responsibility strategy – just like social and environmental elements.

AI Ethics: what is it about?

Since I am heading the Data Strategy at a large enterprise, I am not only confronted with technical and use-case challenges, but also with legal and compliance topics around data. This might sound challenging and “boring”, but it isn’t neither one of them. Technical challenges are often more complex than the legal aspects of data. Many companies state that legal is blocking their data inititives, but often they simply didn’t include legal and privacy on their strategy. So what should you consider when talking about AI Ethics? Basically, it consists out of three building blocks.

Robust

The first building block of ethics is the robustness of data. This is mainly a technical challenge, but it needs to be done right in all senses. It consists of platforms that are prone to errors and vulnerabilities. It is all about access control, access logging and prevention. Data systems should track who accessed data and prevent unrightfull access. Also, it should implement the “need to know” principle: within a large enterprise, one should only access data that is relevant to his/her job purpose. After finishing the project, access should be revoked.

Ethical

Ethics in AI is an important topic, and bias happens often. There are numerous samples out when algorithms use bias. We are humans and are influenced by bias. Bias comes from how we grew up, what experiences we made in life and a lot of our environment. Bias is bad though, as it limits our thinking. In psychology, there is a term for how to overcome this: fast and slow thinking. Imagine you have a job interview (you are the interviewer). A candidate walks in and she immediately reminds you because of some aspects about a person you met years ago and had difficulties with. During the job interview, you might not like her, even though she would be different. Your brain went into fast thinking – input-output. This is built in our brains to prevent us from danger, but often drives bias. It helps us driving a car, doing sports and alike. If you see an obstacle in your way driving a car, you need to react fast. There is no time to think over it again. However, when making decisions, you need to remove bias and think slow.

Slow thinking is challenging and you fully need to overcome bias. If you let bias dominate you, you won’t be capable of doing good decisions. Coming back to the interview example, you might reject the candidate because of your bias. After some month, this person found a job at your competitor and is building more advanced models than your company. You lost a great candidate because of your bias. This isn’t good, right?

There are other aspects to ethicas and I could probably write about this an entire series. But you also need to consider other topics, such as harrasement in algorithms. If your algorithms don’t take ethics into consideration, it isn’t just about acting wrong. You will also loose the credibility with your customers and thus start to see financial impact as well!

Legal

Last but not least, your data strategy should reflect all building blocks of legal frameworks. With the right to forget, this needs to be implemented in your systems. In enterprise environments, it isn’t easy at all. There is a lot of legacy and different systems consuming data. To tackle this from a technical perspective, it is necessary to harmonize your data models. Depending on your company ownership and structure, you need to implement GDPR and/or SOX. Different industries even come with more regulations, such as the finance industry, giving you more challenges around data. It is very important to talk to your legal department and make them your friends at an early stage in order to succeed!

So what is next for AI Ethics?

I keep it with the previous statement mentioned several times: work closely with Legal and Privacy in order to achieve a responsible strategy towards data and AI. A lot of people I know claim that AI Ethics rather blocks their strategy on data, but I argue it is the other way around: just because you can do stuff with data, it doesn’t justify doing all of what you potentially could do. By the end of the day, you have customers that should trust you. Don’t miss-use this trust and build an ethical strategy on it. Work with those people that know it best – Privacy, Security and Legal. Then – and only then – you will succeed.

I also recommend you reading my post about data access.

Credits: the three pillar points weren’t invented by myself, so I want to credit those people that gave me the ideas around it: our corporate lawyer Daniel, our Privacy Officer Paul and our Legal Counsel Doris.

During the past tutorials, we have aquired a lot of knowledge about Spark. Now, we are with the last tutorial on Spark, where we will have a look at Cube and Rollup. Basically both are useful for multi-dimensional data for further processing.

Data for Spark Rollup and Cube functions

First, let’s create a dataset that we later want to work with. Our dataset is the monthly salary of people working in Finance or Sales:

employees = spark.createDataFrame([("Mario", 4400, "Sales")\
                                  , ("Max", 3420, "Finance")\
                                  , ("Sue", 5500, "Sales")\
                                  , ("Tom", 6700, "Finance")]\
                                 , ("name", "salary", "department"))

We then use the first function – rollup. We want to have the rollup to be on the department and the name of the person.

employees.rollup(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here you can see the output (I will discuss it after you reviewed it):

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

We have several lines in this now. Let’s look at it line-by-line:

  • The first line is consisting of two null values and the sum of all salaries. So, this would represent the entire company. Basically, it fills department and name with null, since it is neither a department nor a specific person – it is all departments and all persons in it.
  • The second and third line are Max and Tom, who work in the finance department
  • The fourth line is the sum of the finance department; here you see “null” in the name, since it isn’t a name, but the entire department
  • The same story continues for the following lines with the sales department

So, basically, we get different things: (A) the sum of all revenues, (B) the individual values and (C) the revenues per department. Now, let’s build the cube:

employees.cube(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here, the results are in even more dimensions. First, we have the values of each person, but not from the department. Then, we have all results and then again the departments and individuals in it. The cube isn’t relevant for us for this calculation much. The background is that a cube creates all possible combinations, whereas the rollup only creates hierarchies. The cube also treats null’s as a possible combination, that’s why we have the individuals here several times. Here is the output:

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null|  Max|  3420|
|      null|Mario|  4400|
|      null|  Sue|  5500|
|      null|  Tom|  6700|
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

I hope you liked the tutorials on Spark. There is much more to learn – e.g. about machine learning or different libraries for that. Make sure to check out the tutorial section in order to figure that out.

If you enjoyed this tutorial on spark rollup and cube, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.