cloud computing header

Honestly, a data scientist is doing a great job. Literally, they are saving all industries from a strong decline. And those heroes, they are doing all of that alone. Alone? Not fully.

The Data Scientist need the Data Engineer

There are some poor guys that support their success: those, that are called Data Engineers. A huge majority of tasks has been carried out by these guys (and girls) that hardly anyone is talking about. All the fame seems to be going to the data scientists but the data engineers aren‘t receiving any credits.

I remember one of the many meetings with C-Level executives I had. When I explained the structure of a team dealing with data, everyone in the board room agreed on „we need data scientists“. Then, one of the executives raised the question: „but what are these data engineers about? Do we really need them or could we maybe have more data scientists instead of them“.

I kept on explaining and they accepted it. But I had the feeling that they still wanted to go with more Data Scientists than Engineers eventually. This basically comes out of the trend and hype around the data scientists we see. Everyone knows that they are important. But data driven projects only succeed when a team with mixed skills and know-how is coming together.

A Data Science team needs at least the same number of Data Engineers

In all data driven projects I saw so far, it would have never worked without data engineers. They are relevant for many different things – but mainly – and in an ideal world – working in close cooperation with data scientists. If the maturity in a company for data is high, the data engineer would prepare the data for the data scientist and then work with the data scientist again on putting the algorithm back into production. I saw a lot of projects where the later one wasn‘t working – basically, the first steps were successful (data preparation) but the later step (automation) was never done.

But, there are more roles involved in that: one role, which is rather a specialization of the data engineer is the data system engineer. This is not often a dedicated role, but carried out by data engineers. Here, we basically talk about infrastructure preparation and set-up for the data scientists or engineers. Another role is the one of the data architect that ensures a company-wide approach on data and of course data owners and data stewards.

I stated it several times, but it is worth stating it over and over again: data science isn‘t a one (wo)man show, it is ALWAYS a team effort.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Another interesting article about the data science team setup can be found here.

In the last tutorial about Hive, we had a look at how to insert data into hive. Now, that we have the data in Hive, we look at how to access this data. Querying data is very easy and can be done by some easy steps with the hive SELECT statement. In it’s easiest form, it looks like the following:

The Hive select statement

SELECT fieldnames FROM tablename; 
  • fieldnames: Name of the fields to query – e.g. ID, firstname, lastname
  • tablename: Name of the table to query the fields from – e.g. students

Of course, there is much more than that. After the tablename, you could specify a “WHERE” statement. This statement is capable of filtering data on specific criterias. A sample would be to limit the number of results only to students that are younger than 18. The following describes the Where-Statement.

SELECT fieldnames FROM tablename WHERE wherestatement; 

Often, you want to order data in a certain way. This can be achieved by the “ORDER BY” statement. With this statement, you can order by the specified fields. A sample would be to sort them based on age. The statement is written like this:

SELECT fieldnames FROM tablename ORDER BY orderstatement;

Often, you only want to have a certain number of results returned. This can be done with the “LIMIT” statement. E.g. you only want to have the 10 most relevant items returned:

 SELECT fieldnames FROM tablename LIMIT number; 

Another common case is to group results by specific fields. This is useful if you want to create some functions on that:

SELECT fieldnames FROM tablename GROUP BY fieldnames;

Let’s now look at some code below:

SELECT * FROM university.students;
SELECT * FROM university.students WHERE gender = "female";
SELECT * FROM university.students WHERE gender = "female" ORDER BY lastname;
SELECT * FROM (
   SELECT lastname, gender FROM university.students) sq;

In the first query, we want to return all students from the table. The second query only returns female students. The third one is ordering them by age. The last query shows that queries can be based on queries.

There are much more functions that can be applied and they can be chained (e.g. ORDER BY and WHERE).

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

A current trend in AI is not a much technical one – it is rather a societal one. Basically, technologies around AI in Machine Learning and Deep Learning are getting more and more complex. This is making it even more complex for humans to understand what is happening and why a prediction is happening. The current approach in „throwing data in, getting a prediction out“ is not necessarily working for that. It is somewhat dangerous building knowledge and making decisions based on algorithms that we don‘t understand. To solve this problem, we need to have explainable AI.

What is explainable AI?

Explainable AI is getting even more important with new developments in the AI space such as Auto ML. With Auto ML, the system takes most of the data scientist‘s work. It needs to be ensured that everyone understands what‘s going on with the algorithms and why a prediction is happening exactly the way it is. So far (and without AutoML), Data Scientists were basically in charge of the algorithms. At least there was someone that could explain an algorithm. NOTE: it didn‘t prevent us from bias in it, nor will AutoML do. With AutoML, when the tuning and algorithm selection is done more or less automatically, we need to ensure to have some vital and relevant documentation of the predictions available.

And one last note: this isn‘t a primer against AutoML and tools that do so – I believe that democratisation of AI is an absolute must and a good thing. However, we need to ensure that it stays – explainable!

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. A comprehensive article about explainable AI can also be found on wikipedia.

In the previous tutorial, we learnt how to create tables. Now, it is about time to add some data to our tables. Therefore, we will look at how hive insert data into tables with using the Insert statement. This is done straight forward:

Hive insert data into tables

INSERT INTO TABLE name VALUES [values]
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename
  • values: The values to insert into the database. All values for the table must be provided, it is not possible to skip values (like in some other SQL systems)

Another possibility is to insert tables from files. This is done with the following statement:

LOAD DATA INPATH path INTO TABLE name 
  • path: the path of the file to insert from. Typically, with Hive, this would be a file on the hdfs system
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename

It is also possible to insert data from a sub-query. This can be done with this statement:

INSERT INTO TABLE name [select statement]

The only difference to the first statement is that instead of the “values”, we create a select statement. The select statement is described in a later tutorial.

Now, let’s use the sample from the last tutorial and insert some data into our databases.

For the students, we enter this code:

INSERT INTO TABLE university.students VAlUES (1, "Mario", "Meir-Huber", "01/03/1984", "male"),
(2, "Max", "Musterman", "01/01/1988", "male"), (3, "Anna", "Studihard", "05/05/1989", "female"),
(4, "Sara", "Supersmart", "06/06/1990", "female");

For the classes, we enter the following:

INSERT INTO TABLE university.classes VAlUES (1, "Business", "Accounting 1"), (2, "IT", "Software Development 1");

And for the enrolment, we enter the following:

INSERT INTO TABLE university.enrollment VALUES (1, 1), (2, 3), (1, 3), (1, 2), (1, 4), (2, 4)

Now we are all set and can start querying our data. This will happen in the next tutorial.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.