Large enterprises have a lot of legacy systems in their footprint. This created a lot of challenges (but also opportunities!) for system integrators. Now, since companies strive to become data driven, it becomes an even bigger challenge. But luckily there is a new thing out there that can help: data abstraction.

Data Abstraction: why do you need it?

If a company wants to become data driven, it is necessary to unlock all the data that is available within a company. However, this is easier said than done. Most companies have a heterogenous IT landscape and thus struggle to integrate essential data sources into their analytical systems. In the past, there have been several approaches to it. Data was loaded (with some delay) to an analytical data warehouse. This data warehouse didn’t power any operational systems, so it was decoupled.

However, several things proved to be wrong with Data warehouses handling analytical workload: (a) data warehouses tend to be super-expensive in both cost and operations. It makes sense for KPIs and highly structured data, but not for other datasets. And (b) due to the high cost, data warehouses were loaded with some hours to even days of delay. In a real-time world, this isn’t good at all.

But – didn’t the datalake solve this already?

Some years ago, Data lakes surfaced. They were more efficient in terms of speed and cost than traditional data warehouses. However, data warehouses kept the master data, which data lakes often need. So a connection between the two needed to be established. In early days, data was simply replicated in order to do so. Next to datalakes, many other systems (NoSQL mainly) surfaced. Business units aquired different other systems, that made more integration efforts necessary. So, there was no end to data siles at all – it even got worse (and will continue to do so)

So, why not give in to the pressure of heterogenous systems and data stores and try to solve it differently? This is where data abstraction comes into play …

What is it about?

As already introduced, Data Abstraction should reduce your sleepless nights when it comes to accessing and unlocking your data assets. It is like a virtual layer that you add in between your data storages and your data consumers to enable one common access. The following illustration shows this:

Data Abstraction shows how to abstract you data
Data Abstraction

Basically, you build a layer on top of your data sources. Of course, it doesn’t solve the challenges around data integration, but it ensures that consumers can expect to have one common layer that they can plug into. Also, it enables you to exchange the technical layer of a data source without consumers taking note of it. You might consider to re-develop a data source from the ground up, in order to make it more performant. Both the old and the new stack will conform to the data abstraction and thus consumers won’t realize that there are significant changes under the hood.

This sounds really nice. So what’s the (technical) solution to it?

Basically, I don’t recommend any technology at this stage. There are several technologies that enable Data Abstraction. They can be clustered into 3 different areas:

  1. Lightweight SQL Engines: There are several products and tools (both Open Source and non-Open Source) available, which enable SQL access to different data sources. They not only plug into relational databases, but also into non-relational databases. Most tools provide easy integration and abstraction.
  2. API Integration: It is possible to integrate your data sources via an API layer that eventually abstracts the below data sources. The pain of integration is higher than with SQL Engines, but it gives you more flexbility on top and a higher degree of abstraction. In contrast to SQL engines, your consumers won’t plug too deep into database specifics. If you want to go for a really advanced tech stack, I recommend you reading about Graphs.
  3. Full-blown solution: There are several proprietary tools available, that provide numerous connectors to data sources. What is really great about these solutions is that they also include chaching mechanisms for frequent data access. You get much higher performance with limited implementation cost. However, you will lock into a specific soltuion.

Which solution you eventually consider to go for, is fully up to you. It depends on the company and its know-how and characterists. In most cases, it is also a combination of different solutions.

So what is next?

There are many tools and services out there which enable data abstraction. Data Abstraction is more of a concept than a concrete technology – not even an architectural pattern. In some cases, you might acquire a technology. Or you would abstract your data via an API or Graph. There are many technologies, tools and services out there to solve your issues.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

AI and Ethics is a complex and ofthen discussed topic at different conferences, usergroups and forums. It even got picked up by the European commission. I would argue that it should actually go one step further: it should be part of every corporate responsibility strategy – just like social and environmental elements.

AI Ethics: what is it about?

Since I am heading the Data Strategy at a large enterprise, I am not only confronted with technical and use-case challenges, but also with legal and compliance topics around data. This might sound challenging and “boring”, but it isn’t neither one of them. Technical challenges are often more complex than the legal aspects of data. Many companies state that legal is blocking their data inititives, but often they simply didn’t include legal and privacy on their strategy. So what should you consider when talking about AI Ethics? Basically, it consists out of three building blocks.

Robust

The first building block of ethics is the robustness of data. This is mainly a technical challenge, but it needs to be done right in all senses. It consists of platforms that are prone to errors and vulnerabilities. It is all about access control, access logging and prevention. Data systems should track who accessed data and prevent unrightfull access. Also, it should implement the “need to know” principle: within a large enterprise, one should only access data that is relevant to his/her job purpose. After finishing the project, access should be revoked.

Ethical

Ethics in AI is an important topic, and bias happens often. There are numerous samples out when algorithms use bias. We are humans and are influenced by bias. Bias comes from how we grew up, what experiences we made in life and a lot of our environment. Bias is bad though, as it limits our thinking. In psychology, there is a term for how to overcome this: fast and slow thinking. Imagine you have a job interview (you are the interviewer). A candidate walks in and she immediately reminds you because of some aspects about a person you met years ago and had difficulties with. During the job interview, you might not like her, even though she would be different. Your brain went into fast thinking – input-output. This is built in our brains to prevent us from danger, but often drives bias. It helps us driving a car, doing sports and alike. If you see an obstacle in your way driving a car, you need to react fast. There is no time to think over it again. However, when making decisions, you need to remove bias and think slow.

Slow thinking is challenging and you fully need to overcome bias. If you let bias dominate you, you won’t be capable of doing good decisions. Coming back to the interview example, you might reject the candidate because of your bias. After some month, this person found a job at your competitor and is building more advanced models than your company. You lost a great candidate because of your bias. This isn’t good, right?

There are other aspects to ethicas and I could probably write about this an entire series. But you also need to consider other topics, such as harrasement in algorithms. If your algorithms don’t take ethics into consideration, it isn’t just about acting wrong. You will also loose the credibility with your customers and thus start to see financial impact as well!

Legal

Last but not least, your data strategy should reflect all building blocks of legal frameworks. With the right to forget, this needs to be implemented in your systems. In enterprise environments, it isn’t easy at all. There is a lot of legacy and different systems consuming data. To tackle this from a technical perspective, it is necessary to harmonize your data models. Depending on your company ownership and structure, you need to implement GDPR and/or SOX. Different industries even come with more regulations, such as the finance industry, giving you more challenges around data. It is very important to talk to your legal department and make them your friends at an early stage in order to succeed!

So what is next for AI Ethics?

I keep it with the previous statement mentioned several times: work closely with Legal and Privacy in order to achieve a responsible strategy towards data and AI. A lot of people I know claim that AI Ethics rather blocks their strategy on data, but I argue it is the other way around: just because you can do stuff with data, it doesn’t justify doing all of what you potentially could do. By the end of the day, you have customers that should trust you. Don’t miss-use this trust and build an ethical strategy on it. Work with those people that know it best – Privacy, Security and Legal. Then – and only then – you will succeed.

I also recommend you reading my post about data access.

Credits: the three pillar points weren’t invented by myself, so I want to credit those people that gave me the ideas around it: our corporate lawyer Daniel, our Privacy Officer Paul and our Legal Counsel Doris.

During the past tutorials, we have aquired a lot of knowledge about Spark. Now, we are with the last tutorial on Spark, where we will have a look at Cube and Rollup. Basically both are useful for multi-dimensional data for further processing.

Data for Spark Rollup and Cube functions

First, let’s create a dataset that we later want to work with. Our dataset is the monthly salary of people working in Finance or Sales:

employees = spark.createDataFrame([("Mario", 4400, "Sales")\
                                  , ("Max", 3420, "Finance")\
                                  , ("Sue", 5500, "Sales")\
                                  , ("Tom", 6700, "Finance")]\
                                 , ("name", "salary", "department"))

We then use the first function – rollup. We want to have the rollup to be on the department and the name of the person.

employees.rollup(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here you can see the output (I will discuss it after you reviewed it):

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

We have several lines in this now. Let’s look at it line-by-line:

  • The first line is consisting of two null values and the sum of all salaries. So, this would represent the entire company. Basically, it fills department and name with null, since it is neither a department nor a specific person – it is all departments and all persons in it.
  • The second and third line are Max and Tom, who work in the finance department
  • The fourth line is the sum of the finance department; here you see “null” in the name, since it isn’t a name, but the entire department
  • The same story continues for the following lines with the sales department

So, basically, we get different things: (A) the sum of all revenues, (B) the individual values and (C) the revenues per department. Now, let’s build the cube:

employees.cube(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here, the results are in even more dimensions. First, we have the values of each person, but not from the department. Then, we have all results and then again the departments and individuals in it. The cube isn’t relevant for us for this calculation much. The background is that a cube creates all possible combinations, whereas the rollup only creates hierarchies. The cube also treats null’s as a possible combination, that’s why we have the individuals here several times. Here is the output:

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null|  Max|  3420|
|      null|Mario|  4400|
|      null|  Sue|  5500|
|      null|  Tom|  6700|
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

I hope you liked the tutorials on Spark. There is much more to learn – e.g. about machine learning or different libraries for that. Make sure to check out the tutorial section in order to figure that out.

If you enjoyed this tutorial on spark rollup and cube, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

For Data itself, there are a lot of different sources that are needed. Based on the company and industry, they differ a lot. However, to create a complex view on your company, it isn’t necessary only to have your own data. There are several other data sources you should consider.

The three data sources

The three data sources

Data you already have

The first data source – data you have – seems to be the easiest. However, it isn’t as easy as you might believe. Bringing your data in order is actually a very difficult task and can’t be achieved that easy. I’ve written several blog posts here about the challenges around data and you can review them. Basically, all of them focus on your internal data sources. I won’t re-state them in detail here, but it is mainly about data governance and access.

Data that you can acquire

The second data source – data you can acquire – is another important aspect. By acquire I basically mean everything that you don’t have to pay to an external party as data provider. You might use surveys (and pay for it as well) or acquire the data from open data platforms. Also, you might collect data from social media or with other kind of crawlers. This data source is very important for you, as you can get great overview and insights into your specific questions.

In the past, I’ve seen a lot of companies utilising the second one and we did a lot on that aspect. For this kind of data, you don’t necessarily have to pay for it – some data sources are free. And if you pay for something, you don’t pay for the data itself but rather for the (semi)-manual way of collecting it. Also here, it differs heavily from industry to industry and what the company is all about. I’ve seen companies collecting data from news sites to get insights into their competition and mentions or simply by scanning social media. A lot is possible with this aspect of data source.

Data you can buy

The last one – data you can buy – is easy to get but very expensive in cash-out terms. There are a lot of data providers selling different kind of data. Often, it is demographic data or data about customers. Different platforms collect data from a large number of online sites and thus track individuals over different sites and their behavior. Such platforms then sell this kind of data to marketing departments with more insights. Also here, you can buy this kind of data from that platforms and thus enrich your own first-party and second-party data. Imagine, you are operating a retail business selling all kind of furniture.

You would probably not know much about your web shop visitors, since they are anonymous until they buy something. With data bought from such kind of data providers, it would now be possible for you to figure out if an anonymous visitor is an outdoor enthusiast. You might adjust your offers to match his or her interest best. Or, you might learn that the person visiting your shop recently bought a countryside house with a garden. You might now adjust your offers to present garden furniture or Barbecue accessories. With this kind of third party data, you can achieve a lot and better understand your customers and your company.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

In the previous tutorial, we learned about data cleaning in Spark. Today, we will look at different options to work with columns and rows in Spark. First, we will start with renaming columns. We did this already several times so far, and it is a frequent task in data engineering. In the following sample, we will rename a column:

thirties = clean.select(clean.name, clean.age.between(30, 39)).withColumnRenamed("((age >= 30) AND (age <= 39))", "goodage")
thirties.show()

As you could see, we took the old name – which was very complicated – and renamed it to “goodage”. The output should be the following:

+-----+-------+
| name|goodage|
+-----+-------+
|  Max|  false|
|  Tom|   true|
|  Sue|  false|
|Mario|   true|
+-----+-------+

In the next sample, we want to filter columns on a string-expression. This can be done with the “endswith” method being applied to the column name that should be filtered. In the following sample, we want to filter all contacts that are from Austria:

austrian = clean.filter(clean.lang.endswith("at"))
austrian.show()

As you can see, only one result is returned (as expected):

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

Removing Null-Values in Spark

In our next sample, we want to filter all rows that contain null values in a specific column. This is useful to get a glimpse of null values in datasets. This can easily be done by applying the “isNull” function on a column:

nullvalues = dirtyset.filter(dirtyset.age.isNull())
nullvalues.show()

Here, we get the two results containing these null values:

+---+----+----+-----+
|nid|name| age| lang|
+---+----+----+-----+
|  4| Tom|null|AT-ch|
|  5| Tom|null|AT-ch|
+---+----+----+-----+

Another useful function in Spark is the “Like” function. If you are familiar with SQL, it should be easy to apply this. If not – basically, it scans text in a column, which contains one or more specific literals. You can use different expressions to filter for patterns. The following one filters all people that have “DE” in it, independent of what follows afterwards (“%”):

langde = clean.filter(clean.lang.like("DE%"))
langde.show()

Here, we get all items:

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  2|  Max| 46|DE-de|
|  4|  Tom| 34|DE-ch|
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

Shorten Strings in a Column in Spark

Several times, we want to shorten string values. The following sample takes the first 2 letters with the “substr” function on the column. We afterwards apply the “alias” function, which renames the function (similar to the “withColumnRenamed” function above).

shortnames = clean.select(clean.name.substr(0,2).alias("sn")).collect()
shortnames

Also here, we get the expected output; please note that it isn’t unique anymore (names!):

[Row(sn='Ma'), Row(sn='To'), Row(sn='Su'), Row(sn='Ma')]

Spark offers much more functionality to manipulate Columns, so just play with the API :). In the next tutorial, we will have a look at how to build Cubes and Rollups in Spark

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.


In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. In this post, I will give an introduction to deep learning. Over the last couple of years, this was the hype around AI. But what is so exciting about Deep Learning? First, let’s have a look at the concepts of Deep Learning.

A brief introduction to Deep Learning

Basically, Deep Learning should function similar to the human brain. Everything is built around Neurons, which work in networks (neural networks). The smallest element in a neural network is the neuron, which takes an input parameter and creates an output parameter, based on the bias and weight it has. The following image shows the Neuron in Deep Learning:

The Neuron in a Neuronal Network in Deep Learning
The Neuron in a Neuronal Network in Deep Learning

Next, there are Layers in the Network, which consists of several Neurons. Each Layer has some transformations, that will eventually lead to an end result. Each Layer will get much closer to the target result. If your Deep Learning model built to recognise hand writing, the first layer would probably recognise gray-scales, the second layer a connection between different pixels, the third layer would recognise simple figures and the fourth layer would recognise the letter. The following image shows a typical neural net:

A neural net for Deep Learning
A neural net for Deep Learning

A typical workflow in a neural net calculation for image recognition could look like this:

  • All images are split into batches
  • Each batch is sent to the GPU for calculation
  • The model starts the analysis with random weights
  • A cost function gets specified, that compares the results with the truth
  • Back propagation of the result happens
  • Once a model calculation is finished, the result is merged and returned

How is it different to Machine Learning?

Although Deep Learning is often considered to be a “subset” of Machine Learning, it is quite different. For different aspects, Deep Learning often achieves better results than “traditional” machine learning models. The following table should provide an overview of these differences:

Machine Leaning Deep Learning
Feature extraction happens manuallyFeature extraction is done automatically
Features are used to create a model that categorises elementsPerforms “end-to-end learning” 
Shallow learning  Deep learning algorithms scale with data

This is only the basic overview of Deep Learning. Deep Learning knows several different methods. In the next tutorial, we will have a look at different interpretations of Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Over the past tutorials, we have acquired quite some knowledge about Python and Spark. We know most of the relevant statements in Apache Spark and are now ready for some data cleaning, which is one of the key tasks of a data engineer. Before we get started, we need to have something that a data engineer (unfortunately) often works with: dirty data. We will focus on how to deal with missing, corrupt and wrong data in Spark

Dealing with wrong data in Spark

In the following sample, we create some data. Note, that there are some errors in it:

  1. The first field (id: 1) is having the wrong language – there is no “AT-at”, but it needs to be “DE-at”
  2. The fourth field (id: 4) is having a null-value “None” and also the wrong language – AT-ch, whereas it should be “DE-ch”
  3. The fifth field (id: 5) is a duplicate of the previous one
dirtyset = spark.createDataFrame([(1, "Mario", 35, "AT-at")\
                                  , (2, "Max", 46, "DE-de")\
                                  , (3, "Sue", 22, "EN-uk")\
                                  , (4, "Tom", None, "AT-ch")\
                                  , (5, "Tom", None, "AT-ch")]\
                                 , ("nid", "name", "age", "lang"))
dirtyset.show()

The dataset should look like this:

+---+-----+----+-----+
|nid| name| age| lang|
+---+-----+----+-----+
|  1|Mario|  35|AT-at|
|  2|  Max|  46|DE-de|
|  3|  Sue|  22|EN-uk|
|  4|  Tom|null|AT-ch|
|  5|  Tom|null|AT-ch|
+---+-----+----+-----+

Deleting duplicates in Spark

The first thing we want to do is removing duplicates. There is an easy function for that in Apache Spark – called “dropDuplicates”. When you look at the previous dataset, it might be very easy to figure out that the last one is a duplicate – but wait! For Apache Spark, it isn’t that easy, because the id is different – it is 4 vs 5. Spark doesn’t figure out which columns are relevant to take duplicates from. If we would apply the “dropDuplicates” to the dataframe, it wouldn’t remove anything. So, we need to apply the columns it should take into account when removing duplicates. We tell spark that we want to work with “name” and “age” for this purpose and pass a list of these to the function:

nodub = dirtyset.dropDuplicates(["name", "age"])
nodub.show()

When you now execute the code, it should result in the cleaned dataset:

+---+-----+----+-----+
|nid| name| age| lang|
+---+-----+----+-----+
|  2|  Max|  46|DE-de|
|  4|  Tom|null|AT-ch|
|  3|  Sue|  22|EN-uk|
|  1|Mario|  35|AT-at|
+---+-----+----+-----+

This was very easy so far. Now, let’s take care of the wrong language values.

Replacing wrong values in columns in Apache Spark

The Spark “na” functions provide this for us. There is a function called “na.replace” that takes the old value and the new value to replace on. Since we have two different values, we need to call the function twice:

reallangs = nodub.na.replace("AT-at", "DE-at").na.replace("AT-ch", "DE-ch")
reallangs.show()

As you can see, the values are replaced accordingly and now also this should work:

+---+-----+----+-----+
|nid| name| age| lang|
+---+-----+----+-----+
|  2|  Max|  46|DE-de|
|  4|  Tom|null|DE-ch|
|  3|  Sue|  22|EN-uk|
|  1|Mario|  35|DE-at|
+---+-----+----+-----+

Only one last thing is now necessary: replacing null values in the dataset.

Replacing null values in Apache Spark

Dealing and working with null-values is always a complicated thing to achieve. Null-values basically means that we don’t know something and that it might have a negative impact on our future predictions and analysis. However, there are some solutions:

  • Ignoring null-values by removing the rows containing them
  • adding a standard-value (e.g. in our case either 0 or a very high value)
  • Using statistics to calculate the most appropriate value

The first one would take away a lot of data. The more columns you have, the higher the possibility is to have null-values! This would reduce the relevant samples and thus the accuracy of predictions. The second one would add some other challenges in our case: it would either increase the average to a very high number or to a very low number. So, only the last opportunity stays for us: calculate a value for the dataset. If you have several features, such as name, it is somewhat easier to do so. In our sample, we use a very easy method: just calculating the average from the correct values. The following sample does exactly that, I will explain each step after the sample:

from pyspark.sql.functions import *
avage = reallangs.groupby().agg(avg(reallangs.age)).collect()
rage = int(avage[0][0])
clean = reallangs.na.fill(rage)
clean.show()

So, what has happend here?

  • We start by calculating the average from all ages we have. This is done with the agg() function
  • Next, we convert the average (which is a float) to an int. Since it is of type row, we have to use two indices to get to our value
  • We then call the “na.fill” with the previously calculated average

The output of that should look like the following:

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  2|  Max| 46|DE-de|
|  4|  Tom| 34|DE-ch|
|  3|  Sue| 22|EN-uk|
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

That’s all! Easy, isn’t it? Of course, we could add much more intelligence to it, but let’s keep it as is right now. More will be added in the tutorials on Data Science, which are about to come soon :). Dealing with wrong data in Spark is one of the key things you will do before going for Data Science.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

In the first posts, I introduced different type of Machine Learning concepts. On of them is classification. Basically, classification is about identifying to which set of categories a certain observation belongs. Classifications are normally of supervised learning techniques. A typical classification is Spam detection in e-mails – the two possible classifications in this case are either “spam” or “no spam”. The two most common classification algorithms are the naive bayes classification and the random forest classification.

What classification algorithms are there?

Basically, there are a lot of classification algorithms available and when working in the field of Machine Learning, you will discover a large number of algorithms every time. In this tutorial, we will only focus on the two most important ones (Random Forest, Naive Bayes) and the basic one (Decision Tree)

The Decision Tree classifier

The basic classifier is the Decision tree classifier. It basically builds classification models in the form of a tree structure. The dataset is broken down into smaller subsets and gets detailed by each leave. It could be compared to a survey, where each question has an effect on the next question. Let’s assume the following case: Tom was captured by the police and is a suspect in robing a bank. The questions could represent the following tree structure:

Basic sample of a Decision Tree
Basic sample of a Decision Tree

Basically, by going from one leave to another, you get closer to the result of either “guilty” or “not guilty”. Also, each leaf has a weight.

The Random Forest classification

Random forest is a really great classifier, often used and also often very efficient. It is an ensemble classifier made using many decision tree models. There are ensemble models that combine the different results. The random forest model can both run regression and classification models.

Basically, it divides the data set into subsets and then runs on the data. Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.

There are also some disadvantages with the random forest classifier: the main problem is its complexity. Working with random forest is more challenging than classic decision trees and thus needs skilled people. Also, the complexity creates large demands for compute power.

Random Forest is often used by financial institutions. A typical use-case is credit risk prediction. If you have ever applied for a credit, you might know the questions being asked by banks. They are often fed into random forest models.

The Naive Bayes classifier

The Naive Bayes classifier is based on prior knowledge of conditions that might relate to an event. It is based on the Bayes Theorem. There is a strong independence between features assumed. It uses categorial data to calculate ratios between events.

The benefit of Naive Bayes are different. It can easily and fast predict classes of data sets. Also, it can predict multiple classes. Naive Bayes performs better compared to models such as logistic regression and there is a lot less training data needed.

A key challenge is that if a categorical variable has a category which was not checked in the training data set, then model will assign a 0 (zero) probability, which makes it unable for prediction. Also, it is known to be a rather bad estimator. Also, it is rather complex to use.

As stated, there are many more algorithms available. In the next tutorial, we will have a look at Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In our previous tutorial, we looked at how to join data in Apache Spark. Another frequently used thing when working with data is to reduce the number of results by limit data in spark to a specific number. This is done with the limit statement.

Limit Data in Spark with the limit() method

Basically, the limit statement is very easy. It is easy to use since it only takes the number of results to return as a parameter. The limit statement is usually applied with an order-statement. In the following sample, we use the limit statement on the df_ordered dataset which we introduced in the tutorial on filtering and ordering data in Spark. After the sample, I will explain what the steps are.

sumed = df_ordered.groupby(df_ordered.personid) \
                  .agg(sum(df_ordered.price)) \
                  .toDF("pid", "ordervalue")
newPers = df_ordered.join(sumed, sumed.pid == df_ordered.personid, "inner") \
                    .drop("productname", "price", "pid").distinct() \
                    .orderBy("ordervalue", ascending=False) \
                    .limit(10)
newPers.show()

Basically, the above sample shows the top 10 customers from our dataset. The following steps are applied:

  1. Grouping the dataset by the person id
  2. Creating the sum of products bought by the customer
  3. And creating a new dataframe from it

We then join the dataset of ordered values back into the person data. Spark doesn’t allow appending this data and keeping all the original values (like personname, age, …) in it. In the next statement, we do the following:

  1. We join the newly created dataset into the original dataset
  2. Remove the unnecessary items such as productname, price and pid
  3. Order everything by ordervalue descending
  4. and limit the results to only have the top 10 customers.

Now, the result should look like the following:

+--------+----------+---+-----+------------------+
|personid|personname|age|state|        ordervalue|
+--------+----------+---+-----+------------------+
|     162|     Heidi| 37|   GA|24269.340000000226|
|      38|     Daisy| 45|   CA|23799.450000000204|
|     140|     Elsie| 64|   FL|  23759.5400000002|
|      18|      Ruby| 47|   GA|23414.710000000185|
|     180|   Caitlin| 65|   NY| 23124.71000000019|
|     159|    Taylor| 41|   NY|23054.670000000162|
|     131|     Aaron| 67|   TX| 23049.63000000016|
|      49|     Dylan| 47|   TX| 23029.68000000018|
|     136|    Isabel| 52|   CA| 22839.85000000014|
|      43|     Mason| 30|   CA|22834.710000000185|
+--------+----------+---+-----+------------------+

The limit statement itself is very easy, however, it is a bit more complex on how to get towards using the statement ;). In the next tutorial, we will look at how to deal with corrupt data – get ready for some data cleaning!

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.