In one of my last posts, I wrote about the fact that Cloud is more PaaS/FaaS then IaaS already. In fact, IaaS doesn’t bring much value at all over traditional architectures. There still are some advantages, but they remain limited. If you want to go for a future-proove archtiecture, Analytics needs to be serverless analytics. In this article, I will explain why.

What is serverless analytics?

Just as similar with serverless technologies, serverless analytics also follows the same concept. Basically, the idea behind that is to significantly reduce the work on infrastructure and servers. Modern environments allow us to “only” bring the code and the cloud provider takes care about everything else. This is basically the dream of every developer. Do you know the statement “it works on my machine”? With serverless, this is ways easier. You only need to focus on the app itself, without any requirements on operating system and stack. Also, execution is task- or consumption-based. This means that eventually you only pay for what is used. If your service isn’t utilised, you don’t pay for it. You can also achieve this with IaaS, but with serverless it is part of the concept and not something you need to enable on.

With Analytics, we now also march towards the serverless approach. But why only now? Serverless is around for already some time? Well, if we look at the data analytics community, it always used to be a bit slower than the overall industry. When most tech stacks already migrated to the Cloud, analytics projects were still carried out with large Hadoop installations in the local data center. Also back then, the Cloud was already superior. However, a lot of people still insisted on it. Now, data analytics workloads are moving more and more into the Cloud.

What are the components of Serverless Analytics?

  • Data Integration Tools: Most cloud providers provide easy to use tools to integrate data from different sources. A GUI makes the use of this easier.
  • Data Governance: Data Catalogs and quality management tools are also often parts of any solution. This enables a ways better integration.
  • Different Storage options: Basically, for serverless analytics, storage must always be decoupled from the analytics layer. Normally, there are different databases available. But most of the data is stored on object stores. Real-time data is consumed via a real-time engine.
  • Data Science Labs: Data Scientists need to experiment with data. Major cloud providers have data science labs available, which enable this sort of work.
  • API for integration: With the use of APIs, it is possible to bring back the results into production- or decision-making systems.

How is it different to Kubernetes or Docker?

At the moment, there is also a big discussion if Kubernetes or Docker will solve this job with Analytics. However, this again requires the usage of servers and thus increases the maintenance at some point. All cloud providers have different Kubernetes and Docker solutions available, which allows an easy migration later on. However, I would suggest to go immediately for serverless solutions and avoid the use of containers if avoidable.

What are the financial benefits?

It is challenging to measure the benefits. If the only comparison is price, then it is probably not the best way to do so. Serverless Analytics will greatly reduce the cost of maintaining your stack – this will go close to zero! The only thing you need to focus on from now on is your application(s) – and they should eventually produce value. Also, it is easier to measure IT on the business impact. You get a bill for the applications, not for maintaining a stack. If you run an analysis, you will get a quote for it and the business impact may or may not justify the investment.

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

In the last two posts we introduced the core concepts of Deep Learning, Feedforward Neural Network and Convolutional Neural Network. In this post, we will have a look at two other popular deep learning techniques: Recurrent Neural Network and Long Short-Term Memory.

Recurrent Neural Network

The main difference to the previously introduced Networks is that the Recurrent Neural Network provides a feedback loop to the previous neuron. This architecture makes it possible to remember important information about the input the network received and takes the learning into consideration along with the next input. RNNs work very well with sequential data such as sound, time series (sensor) data or written natural languages.

The advantage of a RNN over a feedforward network is that the RNN can remember the output and use the output to predict the next element in a series, while a feedforward network is not able to fed the output back to the network. Real-time gesture tracking in videos is another important use-case for RNNs.

A Recurrent Neural Network
A Recurrent Neural Network

Long Short-Term Memory

A usual RNN has a short-term memory, which is already great at some aspect. However, there are requirenments for more advanced memory functionality. Long Short-Term Memory is solving this problem. The two Austrian researchers Josef Hochreiter and Jürgen Schmidhuber introduced LSTM. LSTMs enable RNNs to remember inputs over a long period of time. Therefore, LSTMs are used in combination with RNNs for sequential data which have long time lags in between.

LSTM learns over time on which information is relevant and what information isn’t relevant. This is done by assigning weights to information. This information is then assigned to three different gates within the LSTM: the input gate, the output gate and the “forget” gate.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Working with data is a complex thing and not done in some days. It is rather a matter of several sequential steps that lead to a final output. In this post, I present the data science process for project execution in data science.

What is the Data Science Process?

Data Science is often mainly consisting of data wrangling and feature engineering, before one can go to the exciting stuff. Since data science is often very exploratory, processes didn’t much evolve around it (yet). In the data science process, I group the process in three main steps that have several sub-steps in it. Let’s first start with the three main steps:

  • Data Acquisition
  • Feature Engineering and Selection
  • Model Training and Extraction

Each of the different process main steps contains some sub-steps. I will describe them a bit in detail now.

Step 1: Data Acquistion

Data Engineering is the main ingredient in this step. After a business question was formulated, it is necessary to now look for the data. In an ideal setup, you would already have a data catalog in your enterprise. If not, you might need to ask several people until you have found the right place to dig deeper into it.

First of all, you need to acquire the data. They might be internal sources but you might also combine them with external sources. In this context, you might want to read about the different data sources you need. Once you are done with having a first look at the data, it is necessary to integrate the data.

Data integration is often perceived as a challenging task. You need to setup a new environment to store the data or you need to extend an existing schema. A common practise is to build a data science lab. A data science lab should be an easy platform for data engineers and data scientists to work with data. A best practise for that is to use a prepared environment in the cloud for it.

After integrating the data, there comes the heavy part of cleaning the data. In most cases, Data is very nasty and thus needs a lot of cleaning with it. This is also mainly carried out by data engineers alongside with data analysts in a company. Once you are done with the data acquisition part of it, you can move on with the feature engineering and selection step.

Typically, this first process step can be very painful and long-lasting. It depends on different factors of an enterprise, such as the data quality itself, the availability of a data catalog and corresponding metadata descriptions. If your maturity in all these items is very high, it can take some days to a week, but in average it is rather 2 to 4 weeks of work.

Step 2: Feature Engineering and Selection

In the next step we start with a very important step in the Data Science process: Feature Engineering. Features are very important for Machine Learning and have a huge impact on the quality of the predictions. With Feature Engineering, you have to understand the domain you are in and what to use with it. One need to understand what data to use and for what reason.

After the feature engineering itself, it is necessary to select the relevant features with the feature selection. A common mistake is the overfitting of a model, or also called “feature explosion”. It happens often that too many features are created and thus the predictions aren’t accurate anymore. Therefore, it is very important to select only those features that are relevant to the use-case and thus bring some significance.

Another important step is the development of the cross-validation structure. This is necessary to check how the model will perform in practice. The cross-validation will measure the performance of your model and give you insights on how to use it. Next after that is the Hyperparameter tuning. Hyperparameters are fine-tuned to improve the prediction of your model.

This process is now carried out mainly by Data Scientists, but still supported by data engineers. The next and final step in the data science process is Model Training and Extraction.

Step 3: Model Training and Extraction

The last step in the process is the model training and extraction. In this step, the algorithm(s) for the model prediction are selected and compared to each other. In order to ease up work here, it is necessary to put all your process into a pipeline. (Note: I will explain the concept of the pipeline in a later post). After the training is done, you can go into the predictions itself and bring the model into production.

The following illustration outlines the now presented process:

This image describes the Data Science process in it's three steps: Data acquisition, Feature Engineering and Selection, Model Training and Extraction
The Data Science Process

The Data Science process itself can easily be carried out in a Scrum or Kanban approach, depending on your favourite management style. For instance, you could have each of the 3 process steps as sprints. The first sprint “Data Acquisition” might last longer than the other sprints or you could even break the first one into several sprints. For Agile Data Science, I can recommend you reading this post.

About 1,5 years ago I was writing that Cloud is not the future. Instead, I claimed that it is the present. In fact, most companies are already embracing the Cloud. Today, I want to revisit this statement and take it to the next level: Cloud IaaS is not the Future

What is wrong about Cloud IaaS?

Cloud IaaS was great in the early days of the Cloud. It gave us freedom to move our workloads in a lift-and-shift scenario to the Cloud. Also, it greatly improved how we can handle workloads in a more dynamic way. Adding servers and shutting them down on demand was really easy. In an on-premise scenario, this was far from easy. All big Cloud providers today provide a comprehensive toolset and third-party applications for IaaS solutions. But why is it not as great as it used to be?

Honestly, I was never a big fan of IaaS in the Cloud. To state it blunt, it didn’t improve much (other than scale and flexibility) to an on-premise world. With Cloud IaaS, we still have to maintain all our servers like in the old days with on-premise. Security patches, updates, fixes and alike stays with those that build the services. Since the early days, I was a big fan of Cloud PaaS (Platform as a Service). 

What is the status of Cloud PaaS?

Over the last 1.5 years, a lot of mature Cloud PaaS services emerged. Cloud PaaS has been around for almost 10 years, but the current level of maturity is impressive. Some two years ago, they have mainly been general purpose Services, but now they moved into very specific domains. There are now a lot of services available for things such as IoT or Data Analytics.

The current trend in Cloud PaaS is definitely the move towards „Serverless Analytics“. Analytics has always been a slow-mover when it came to the Cloud. Other functional areas had already Cloud-native implementations, when Analytical workloads were still developed for the on-premise world. Hadoop was one of these projects, but other projects took over and Hadoop is in the decline. Now, more analytical applications will be developed with a PaaS-stack.

What should you do now?

Cloud PaaS isn’t a revolution or anything spectacular new. If you have no experiences with Cloud PaaS, I would urge you to look at these platforms asap. They will become essential for your business and do provide a lot of benefits. Again – it isn’t the future, it is the present!

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

Large enterprises have a lot of legacy systems in their footprint. This created a lot of challenges (but also opportunities!) for system integrators. Now, since companies strive to become data driven, it becomes an even bigger challenge. But luckily there is a new thing out there that can help: data abstraction.

Data Abstraction: why do you need it?

If a company wants to become data driven, it is necessary to unlock all the data that is available within a company. However, this is easier said than done. Most companies have a heterogenous IT landscape and thus struggle to integrate essential data sources into their analytical systems. In the past, there have been several approaches to it. Data was loaded (with some delay) to an analytical data warehouse. This data warehouse didn’t power any operational systems, so it was decoupled.

However, several things proved to be wrong with Data warehouses handling analytical workload: (a) data warehouses tend to be super-expensive in both cost and operations. It makes sense for KPIs and highly structured data, but not for other datasets. And (b) due to the high cost, data warehouses were loaded with some hours to even days of delay. In a real-time world, this isn’t good at all.

But – didn’t the datalake solve this already?

Some years ago, Data lakes surfaced. They were more efficient in terms of speed and cost than traditional data warehouses. However, data warehouses kept the master data, which data lakes often need. So a connection between the two needed to be established. In early days, data was simply replicated in order to do so. Next to datalakes, many other systems (NoSQL mainly) surfaced. Business units aquired different other systems, that made more integration efforts necessary. So, there was no end to data siles at all – it even got worse (and will continue to do so)

So, why not give in to the pressure of heterogenous systems and data stores and try to solve it differently? This is where data abstraction comes into play …

What is it about?

As already introduced, Data Abstraction should reduce your sleepless nights when it comes to accessing and unlocking your data assets. It is like a virtual layer that you add in between your data storages and your data consumers to enable one common access. The following illustration shows this:

Data Abstraction shows how to abstract you data
Data Abstraction

Basically, you build a layer on top of your data sources. Of course, it doesn’t solve the challenges around data integration, but it ensures that consumers can expect to have one common layer that they can plug into. Also, it enables you to exchange the technical layer of a data source without consumers taking note of it. You might consider to re-develop a data source from the ground up, in order to make it more performant. Both the old and the new stack will conform to the data abstraction and thus consumers won’t realize that there are significant changes under the hood.

This sounds really nice. So what’s the (technical) solution to it?

Basically, I don’t recommend any technology at this stage. There are several technologies that enable Data Abstraction. They can be clustered into 3 different areas:

  1. Lightweight SQL Engines: There are several products and tools (both Open Source and non-Open Source) available, which enable SQL access to different data sources. They not only plug into relational databases, but also into non-relational databases. Most tools provide easy integration and abstraction.
  2. API Integration: It is possible to integrate your data sources via an API layer that eventually abstracts the below data sources. The pain of integration is higher than with SQL Engines, but it gives you more flexbility on top and a higher degree of abstraction. In contrast to SQL engines, your consumers won’t plug too deep into database specifics. If you want to go for a really advanced tech stack, I recommend you reading about Graphs.
  3. Full-blown solution: There are several proprietary tools available, that provide numerous connectors to data sources. What is really great about these solutions is that they also include chaching mechanisms for frequent data access. You get much higher performance with limited implementation cost. However, you will lock into a specific soltuion.

Which solution you eventually consider to go for, is fully up to you. It depends on the company and its know-how and characterists. In most cases, it is also a combination of different solutions.

So what is next?

There are many tools and services out there which enable data abstraction. Data Abstraction is more of a concept than a concrete technology – not even an architectural pattern. In some cases, you might acquire a technology. Or you would abstract your data via an API or Graph. There are many technologies, tools and services out there to solve your issues.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

AI and Ethics is a complex and ofthen discussed topic at different conferences, usergroups and forums. It even got picked up by the European commission. I would argue that it should actually go one step further: it should be part of every corporate responsibility strategy – just like social and environmental elements.

AI Ethics: what is it about?

Since I am heading the Data Strategy at a large enterprise, I am not only confronted with technical and use-case challenges, but also with legal and compliance topics around data. This might sound challenging and “boring”, but it isn’t neither one of them. Technical challenges are often more complex than the legal aspects of data. Many companies state that legal is blocking their data inititives, but often they simply didn’t include legal and privacy on their strategy. So what should you consider when talking about AI Ethics? Basically, it consists out of three building blocks.

Robust

The first building block of ethics is the robustness of data. This is mainly a technical challenge, but it needs to be done right in all senses. It consists of platforms that are prone to errors and vulnerabilities. It is all about access control, access logging and prevention. Data systems should track who accessed data and prevent unrightfull access. Also, it should implement the “need to know” principle: within a large enterprise, one should only access data that is relevant to his/her job purpose. After finishing the project, access should be revoked.

Ethical

Ethics in AI is an important topic, and bias happens often. There are numerous samples out when algorithms use bias. We are humans and are influenced by bias. Bias comes from how we grew up, what experiences we made in life and a lot of our environment. Bias is bad though, as it limits our thinking. In psychology, there is a term for how to overcome this: fast and slow thinking. Imagine you have a job interview (you are the interviewer). A candidate walks in and she immediately reminds you because of some aspects about a person you met years ago and had difficulties with. During the job interview, you might not like her, even though she would be different. Your brain went into fast thinking – input-output. This is built in our brains to prevent us from danger, but often drives bias. It helps us driving a car, doing sports and alike. If you see an obstacle in your way driving a car, you need to react fast. There is no time to think over it again. However, when making decisions, you need to remove bias and think slow.

Slow thinking is challenging and you fully need to overcome bias. If you let bias dominate you, you won’t be capable of doing good decisions. Coming back to the interview example, you might reject the candidate because of your bias. After some month, this person found a job at your competitor and is building more advanced models than your company. You lost a great candidate because of your bias. This isn’t good, right?

There are other aspects to ethicas and I could probably write about this an entire series. But you also need to consider other topics, such as harrasement in algorithms. If your algorithms don’t take ethics into consideration, it isn’t just about acting wrong. You will also loose the credibility with your customers and thus start to see financial impact as well!

Legal

Last but not least, your data strategy should reflect all building blocks of legal frameworks. With the right to forget, this needs to be implemented in your systems. In enterprise environments, it isn’t easy at all. There is a lot of legacy and different systems consuming data. To tackle this from a technical perspective, it is necessary to harmonize your data models. Depending on your company ownership and structure, you need to implement GDPR and/or SOX. Different industries even come with more regulations, such as the finance industry, giving you more challenges around data. It is very important to talk to your legal department and make them your friends at an early stage in order to succeed!

So what is next for AI Ethics?

I keep it with the previous statement mentioned several times: work closely with Legal and Privacy in order to achieve a responsible strategy towards data and AI. A lot of people I know claim that AI Ethics rather blocks their strategy on data, but I argue it is the other way around: just because you can do stuff with data, it doesn’t justify doing all of what you potentially could do. By the end of the day, you have customers that should trust you. Don’t miss-use this trust and build an ethical strategy on it. Work with those people that know it best – Privacy, Security and Legal. Then – and only then – you will succeed.

I also recommend you reading my post about data access.

Credits: the three pillar points weren’t invented by myself, so I want to credit those people that gave me the ideas around it: our corporate lawyer Daniel, our Privacy Officer Paul and our Legal Counsel Doris.

During the past tutorials, we have aquired a lot of knowledge about Spark. Now, we are with the last tutorial on Spark, where we will have a look at Cube and Rollup. Basically both are useful for multi-dimensional data for further processing.

Data for Spark Rollup and Cube functions

First, let’s create a dataset that we later want to work with. Our dataset is the monthly salary of people working in Finance or Sales:

employees = spark.createDataFrame([("Mario", 4400, "Sales")\
                                  , ("Max", 3420, "Finance")\
                                  , ("Sue", 5500, "Sales")\
                                  , ("Tom", 6700, "Finance")]\
                                 , ("name", "salary", "department"))

We then use the first function – rollup. We want to have the rollup to be on the department and the name of the person.

employees.rollup(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here you can see the output (I will discuss it after you reviewed it):

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

We have several lines in this now. Let’s look at it line-by-line:

  • The first line is consisting of two null values and the sum of all salaries. So, this would represent the entire company. Basically, it fills department and name with null, since it is neither a department nor a specific person – it is all departments and all persons in it.
  • The second and third line are Max and Tom, who work in the finance department
  • The fourth line is the sum of the finance department; here you see “null” in the name, since it isn’t a name, but the entire department
  • The same story continues for the following lines with the sales department

So, basically, we get different things: (A) the sum of all revenues, (B) the individual values and (C) the revenues per department. Now, let’s build the cube:

employees.cube(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here, the results are in even more dimensions. First, we have the values of each person, but not from the department. Then, we have all results and then again the departments and individuals in it. The cube isn’t relevant for us for this calculation much. The background is that a cube creates all possible combinations, whereas the rollup only creates hierarchies. The cube also treats null’s as a possible combination, that’s why we have the individuals here several times. Here is the output:

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null|  Max|  3420|
|      null|Mario|  4400|
|      null|  Sue|  5500|
|      null|  Tom|  6700|
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

I hope you liked the tutorials on Spark. There is much more to learn – e.g. about machine learning or different libraries for that. Make sure to check out the tutorial section in order to figure that out.

If you enjoyed this tutorial on spark rollup and cube, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

For Data itself, there are a lot of different sources that are needed. Based on the company and industry, they differ a lot. However, to create a complex view on your company, it isn’t necessary only to have your own data. There are several other data sources you should consider.

The three data sources

The three data sources

Data you already have

The first data source – data you have – seems to be the easiest. However, it isn’t as easy as you might believe. Bringing your data in order is actually a very difficult task and can’t be achieved that easy. I’ve written several blog posts here about the challenges around data and you can review them. Basically, all of them focus on your internal data sources. I won’t re-state them in detail here, but it is mainly about data governance and access.

Data that you can acquire

The second data source – data you can acquire – is another important aspect. By acquire I basically mean everything that you don’t have to pay to an external party as data provider. You might use surveys (and pay for it as well) or acquire the data from open data platforms. Also, you might collect data from social media or with other kind of crawlers. This data source is very important for you, as you can get great overview and insights into your specific questions.

In the past, I’ve seen a lot of companies utilising the second one and we did a lot on that aspect. For this kind of data, you don’t necessarily have to pay for it – some data sources are free. And if you pay for something, you don’t pay for the data itself but rather for the (semi)-manual way of collecting it. Also here, it differs heavily from industry to industry and what the company is all about. I’ve seen companies collecting data from news sites to get insights into their competition and mentions or simply by scanning social media. A lot is possible with this aspect of data source.

Data you can buy

The last one – data you can buy – is easy to get but very expensive in cash-out terms. There are a lot of data providers selling different kind of data. Often, it is demographic data or data about customers. Different platforms collect data from a large number of online sites and thus track individuals over different sites and their behavior. Such platforms then sell this kind of data to marketing departments with more insights. Also here, you can buy this kind of data from that platforms and thus enrich your own first-party and second-party data. Imagine, you are operating a retail business selling all kind of furniture.

You would probably not know much about your web shop visitors, since they are anonymous until they buy something. With data bought from such kind of data providers, it would now be possible for you to figure out if an anonymous visitor is an outdoor enthusiast. You might adjust your offers to match his or her interest best. Or, you might learn that the person visiting your shop recently bought a countryside house with a garden. You might now adjust your offers to present garden furniture or Barbecue accessories. With this kind of third party data, you can achieve a lot and better understand your customers and your company.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

In the previous tutorial, we learned about data cleaning in Spark. Today, we will look at different options to work with columns and rows in Spark. First, we will start with renaming columns. We did this already several times so far, and it is a frequent task in data engineering. In the following sample, we will rename a column:

thirties = clean.select(clean.name, clean.age.between(30, 39)).withColumnRenamed("((age >= 30) AND (age <= 39))", "goodage")
thirties.show()

As you could see, we took the old name – which was very complicated – and renamed it to “goodage”. The output should be the following:

+-----+-------+
| name|goodage|
+-----+-------+
|  Max|  false|
|  Tom|   true|
|  Sue|  false|
|Mario|   true|
+-----+-------+

In the next sample, we want to filter columns on a string-expression. This can be done with the “endswith” method being applied to the column name that should be filtered. In the following sample, we want to filter all contacts that are from Austria:

austrian = clean.filter(clean.lang.endswith("at"))
austrian.show()

As you can see, only one result is returned (as expected):

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

Removing Null-Values in Spark

In our next sample, we want to filter all rows that contain null values in a specific column. This is useful to get a glimpse of null values in datasets. This can easily be done by applying the “isNull” function on a column:

nullvalues = dirtyset.filter(dirtyset.age.isNull())
nullvalues.show()

Here, we get the two results containing these null values:

+---+----+----+-----+
|nid|name| age| lang|
+---+----+----+-----+
|  4| Tom|null|AT-ch|
|  5| Tom|null|AT-ch|
+---+----+----+-----+

Another useful function in Spark is the “Like” function. If you are familiar with SQL, it should be easy to apply this. If not – basically, it scans text in a column, which contains one or more specific literals. You can use different expressions to filter for patterns. The following one filters all people that have “DE” in it, independent of what follows afterwards (“%”):

langde = clean.filter(clean.lang.like("DE%"))
langde.show()

Here, we get all items:

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  2|  Max| 46|DE-de|
|  4|  Tom| 34|DE-ch|
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

Shorten Strings in a Column in Spark

Several times, we want to shorten string values. The following sample takes the first 2 letters with the “substr” function on the column. We afterwards apply the “alias” function, which renames the function (similar to the “withColumnRenamed” function above).

shortnames = clean.select(clean.name.substr(0,2).alias("sn")).collect()
shortnames

Also here, we get the expected output; please note that it isn’t unique anymore (names!):

[Row(sn='Ma'), Row(sn='To'), Row(sn='Su'), Row(sn='Ma')]

Spark offers much more functionality to manipulate Columns, so just play with the API :). In the next tutorial, we will have a look at how to build Cubes and Rollups in Spark

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.