https://pixabay.com/vectors/deep-thought-mind-question-1296377/

To become data-driven, don't forget about the human

Data itself and Data Science especially, is one of the drivers of digitalisation. Many companies experimented with Data Science over the last years and gained significant insights and learnings from it. Often, people dealing with statistics started to do this magic thing called data science. But also technical units used machine learning and alike to further improve their businesses. However, for many other units within traditional companies, all of this seems like magic and dangerous. So how to include others not dealing with the topic in detail and thus de-mystify the topic? So what does it take to become data driven?

How to become data driven

First of all, Machine Learning and Data Science isn‘t the revolution. Units started implementing it in order to gain new insights and improve their business results. However, often it is also acquired via business projects from consulting companies. The newer and complex a topic is, the higher the risk is that people will object it. The reasons for that are fear and mis- or not understanding.

When being deep in the topic of data and data science, you might be treated with fame by some. Mainly by those, that think that you are a magician. However, you will also be rejected by others. Both is poisoning in my opinion. The first group will try to get very close to you and expects a lot. However, you are often not capable of meeting their expectations. After a while, they get frustrated by far too high expectations.

In corporate environments, it is very important to filter this group at the very beginning. You need to clearly state what they can expect and what not. It is also important to state towards them what they won‘t get – and saying „No“ is very important to them as well. Being transparent with this group is essential – in order to keep them close supporters to you in a growing environment. You will depend a lot on those people if you want to succeed. So be clear with them.

People fear digitalisation

The other group – which I would say in digitalisation is the bigger group – is the group that will meet you with fears and doubts. This group is the far larger group and it is highly important that you cover them well. You can easily recognise people in this group by not being open towards your topics. Some are probably actively refusing it, others might not be so active and just poison the climate. But be aware: they usually don‘t do it because they hate you for some reasons.

They are just acting human and are either afraid, feel that they are not included or have other doubts about you and your unit. It is highly essential to work on a communication strategy with this group and pro-actively include them. Bringing clarity and de-mystifying your topic in easy terms is vital. It is important that you create a lot of comparisons to your traditional business and keep it simply. Once you gained their trust and interest, you can get much deeper into your topic and provide learning paths and skill development for those people.

If you succeeded in that, you created strong supporters that will come up with great ideas to improve your business even further. Keep in mind: just because you are in a „hot topic“ like big data and data science and you might be treated like a rock star by some, others are also great in doing things and it all boils down to: we are just humans.

No digitalisation without a data strategy

Digitalisation needs trust to succeed. If you fail to deliver trust and don’t include the human aspect, your digitalisation and data strategy is poised to fail – independent of the budget and C-Level support you might have for your initiative. So, make sure to work on that – with high focus! Becoming data driven is the driver for digitalisation in your company!

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Another article I like about data driven organisations can be found on Forbes.

Hive Tutorial 4: Create a table in hive

In the last tutorial, we looked at how to create databases in Hive. This time, we look at how to create a table in hive. The syntax to create a new Table is as follows:

Create a table in Hive

CREATE TABLE [IF NOT EXISTS] [database.]database_name 
  • IF NOT EXISTS: Prior creation, checks if the database actually exists. If this command isn’t used and the table exist, an error will be displayed.
  • database:Name of the Database in which the table should be created

Also sounds very easy, right? Sorry, but this time I have to disappoint you. Creating tables has several more options, which I removed from the syntax above due to better readability. Following are additional options:

  • COLUMN NAMES: Provides columns and their data types for the table
  • COMMENT: Adds a comment to the table
  • PARTITIONED BY: Provides a partition key (or more) for the table. This is based on the column names
  • CLUSTERED BY: In addition to partitioning, tables can also be clustered into buckets.
  • STORED AS: Stores the table in a specific format – e.g. parquet.
  • LOCATION:Provides a user-specific location for the table

Hive knows several Datatypes. For numbers, they are:

  • Integers: tinyint, smallint, int, bigint
  • Floating-point: float, double, doubleprecision, decimal, decimalprecision

Other basic datatypes are:

  • string, binary, timestamp, date, char, varchar

Non-primitive datatypes are:

  • array, map, struct, union

As already mentioned several times during this tutorial series, Hive basically stores everything on HDFS as files. One of the parameters you can add in “CREATE TABLE” is “STORED AS”. HDFS knows several File formats, that have different benefits. You can start with a large text file, but for better performance, partitioned files in column formats are better. The different file formats possible are: Avro, Parquet, ORC, RCFile, JSONFile. The ideal file format should be selected on the relevant use-case.

Now, we were mainly focusing on how to create tables. However, there might also be the necessity to delete tables. This works with the following statement:

DROP TABLE [IF EXISTS] table_name 

Now, since we know everything we need to know for this, let’s play with Hive. Start your container again and launch the Data Analytics Studio UI. We now create several Tables, that should mimic the structure of a university.

First, let’s start with students. Students have some properties like name and age.

CREATE TABLE IF NOT EXISTS university.students
(studentid INT, firstname STRING, lastname STRING, birthday STRING, gender STRING)
STORED AS PARQUET;

Next, we create a table for classes.

CREATE TABLE IF NOT EXISTS university.classes
(classid INT, studyname STRING, classname STRING)
STORED AS PARQUET;

Next, we need to create a cross-table that creates relations between students and classes.

CREATE TABLE IF NOT EXISTS university.enrollment
(classid INT, studentid INT)
STORED AS PARQUET;

Last, but not least, each student should have a mark when going for a class. Therefore, we create another cross-table between the classid and studentid.

CREATE TABLE IF NOT EXISTS university.marks
(classid INT, studentid INT, mark INT)
STORED AS PARQUET;

In Data Analytics Studio, this should look like the following:

HiveQL Sample

Now, we’ve finished the tables. In the next tutorial, we will insert some data into the tables.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Agile Data Science: Kanban or Scrum in data projects?

Agility is everywhere in the enterprise nowadays. Most companies want to become more agile and also on C-Level, there are huge expectations on agility. However, I’ve seen much of the analytics (and Big Data) projects being the complete opposite: neither agile nor successful. Often, the reasons for this were different: the setup of the datalake with expensive hardware setup took years, not month and the operations with it turned out to be very inefficient to maintain these systems. What can be done for agile data science projects?

The demand for agile data science projects

Also, a lot of companies expressed their demand for agile analytics. But in fact, with analytics (and big data), we moved away from agility and to a complex waterfall-like approach. But what was worse, is the approach of doing agile analytics and then don’t stick to it (and rather do it somewhere in between).

However, a lot of companies also realised that agility can only be solved with (Biz)DevOps and the Cloud. Really, there is hardly any way around this. And a close coop between data engineering and data science. One important question for agile data science projects is the methodology. Is it Kanban or Scrum?

I would say, that this question is a “luxury” problem. If a company has to answer this, it is already at a very high maturity state on data. My ideas on this topic (which, again, is a “it depends” thing) are:

When to select Kanban or Scrum for Data projects

  • Complexity: if the data project is more complex, Scrum might be the better choice. A lot of data science projects are one-person projects (with support of data engineers and devops at some stages) and rather run for some weeks and also not always full-time. In this case (lower complexity), Kanban is the most suiteable approach. Often, the data scientist even works on different projects as the load per project isn’t much at all. Other projects with higher complexity, I would recommend Scrum
  • Integration/Productization: If the integration effort is high (e.g. into existing processes, systems and alike), I would rather recommend to go with Scrum. More people are involved and the complexity is immediately higher. If the focus is on Data Engineering or at least this part is very high, it is often delivered with Scrum.

I guess there could be much more indicators, so I am looking forward to your comments on it 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. You might also read this discussion about Scrum for Data Scientists.

Hive Tutorial 3: Working with the Database in Hive

Actually, there are no “real” database in Hive or Hadoop (unless you install HBase or so). All data is stored in files. However, with HiveQL, you get the feeling that it actually are databases. Therefore, we start by creating “databases” as a first start with Hive.

Working with the Database in Hive

The syntax for creating databases is very easy:

CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT] [LOCATION] [WITH DBPROPERTIES] 

The easiest way to write is “CREATE DATABASE db”. All other options are optional:

  • IF NOT EXISTS: The new database is only created if it doesn’t exist already. If you don’t use this option and the database already exists, an error would be displayed.
  • COMMENT: Provides a comment for a new database, in case this is needed for further explanation.
  • LOCATION: Specifies a hdfs path for the new database.
  • WITH DBPROPERTIES: Specifies some additional properties for the database.

Deleting a database is also very similar to this. You can do this with the following syntax:

DROP DATABASE [IF EXISTS] 
database_name [CASCADE or RESTRICT]

Also here, the statement “DROP DATABASE db” is the easiest one. All other options are optional:

  • IF EXISTS: Prior deletion, checks if the database actually exists. If this command isn’t used and the database doesn’t exist, an error will be displayed.
  • CASCADE: Deletes tables first. If a database is marked for deletion but contains tables, an error would be produced otherwise
  • RESTRICT: Standard behavior for deletion. Would run into an error if tables exist in a database.

Easy, isn’t it? Now, let’s have some fun with Hive and create some Databases. Start the container we’ve created last time with Docker. Starting takes some time. Also make sure to start the hdp-proxy container. If you run into a bad gateway error (502), just wait some time and re-try. After that, you should be able to access Ambari again. Scroll down to “Data Analytics Studio” and click on “Data Analytics Studio UI”. You are then re-directed to a UI where you can write queries. The following image shows the welcome screen. Note: you might get to an error page, since it might wants to re-direct you to a wrong URL. exchange the URL with “127.0.0.1:30800” and you should be fine.

Hortonworks Data Analytics Studio

First, let’s create a new database. We call our database “university”. Note that we will use this database also over the next tutorials. Simply click on “Compose Query”. You should now see the query editor. Enter the following code:

CREATE DATABASE IF NOT EXISTS university;

After clicking “Execute”, your database will be created. The following image shows this:

Data Analytics Studio Query Editor

We also add the “IF NOT EXISTS” statement in order to not run into errors. We can delete the database with the DROP statement:

DROP DATABASE IF EXISTS university;

Re-create the database in case you did the drop now. In the next tutorial, we will look into how to work with tables in Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.