Posts

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

In the last tutorial about Hive, we had a look at how to insert data into hive. Now, that we have the data in Hive, we look at how to access this data. Querying data is very easy and can be done by some easy steps with the hive SELECT statement. In it’s easiest form, it looks like the following:

The Hive select statement

SELECT fieldnames FROM tablename; 
  • fieldnames: Name of the fields to query – e.g. ID, firstname, lastname
  • tablename: Name of the table to query the fields from – e.g. students

Of course, there is much more than that. After the tablename, you could specify a “WHERE” statement. This statement is capable of filtering data on specific criterias. A sample would be to limit the number of results only to students that are younger than 18. The following describes the Where-Statement.

SELECT fieldnames FROM tablename WHERE wherestatement; 

Often, you want to order data in a certain way. This can be achieved by the “ORDER BY” statement. With this statement, you can order by the specified fields. A sample would be to sort them based on age. The statement is written like this:

SELECT fieldnames FROM tablename ORDER BY orderstatement;

Often, you only want to have a certain number of results returned. This can be done with the “LIMIT” statement. E.g. you only want to have the 10 most relevant items returned:

 SELECT fieldnames FROM tablename LIMIT number; 

Another common case is to group results by specific fields. This is useful if you want to create some functions on that:

SELECT fieldnames FROM tablename GROUP BY fieldnames;

Let’s now look at some code below:

SELECT * FROM university.students;
SELECT * FROM university.students WHERE gender = "female";
SELECT * FROM university.students WHERE gender = "female" ORDER BY lastname;
SELECT * FROM (
   SELECT lastname, gender FROM university.students) sq;

In the first query, we want to return all students from the table. The second query only returns female students. The third one is ordering them by age. The last query shows that queries can be based on queries.

There are much more functions that can be applied and they can be chained (e.g. ORDER BY and WHERE).

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

In the previous tutorial, we learnt how to create tables. Now, it is about time to add some data to our tables. Therefore, we will look at how hive insert data into tables with using the Insert statement. This is done straight forward:

Hive insert data into tables

INSERT INTO TABLE name VALUES [values]
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename
  • values: The values to insert into the database. All values for the table must be provided, it is not possible to skip values (like in some other SQL systems)

Another possibility is to insert tables from files. This is done with the following statement:

LOAD DATA INPATH path INTO TABLE name 
  • path: the path of the file to insert from. Typically, with Hive, this would be a file on the hdfs system
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename

It is also possible to insert data from a sub-query. This can be done with this statement:

INSERT INTO TABLE name [select statement]

The only difference to the first statement is that instead of the “values”, we create a select statement. The select statement is described in a later tutorial.

Now, let’s use the sample from the last tutorial and insert some data into our databases.

For the students, we enter this code:

INSERT INTO TABLE university.students VAlUES (1, "Mario", "Meir-Huber", "01/03/1984", "male"),
(2, "Max", "Musterman", "01/01/1988", "male"), (3, "Anna", "Studihard", "05/05/1989", "female"),
(4, "Sara", "Supersmart", "06/06/1990", "female");

For the classes, we enter the following:

INSERT INTO TABLE university.classes VAlUES (1, "Business", "Accounting 1"), (2, "IT", "Software Development 1");

And for the enrolment, we enter the following:

INSERT INTO TABLE university.enrollment VALUES (1, 1), (2, 3), (1, 3), (1, 2), (1, 4), (2, 4)

Now we are all set and can start querying our data. This will happen in the next tutorial.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

In the last tutorial, we looked at how to create databases in Hive. This time, we look at how to create a table in hive. The syntax to create a new Table is as follows:

Create a table in Hive

CREATE TABLE [IF NOT EXISTS] [database.]database_name 
  • IF NOT EXISTS: Prior creation, checks if the database actually exists. If this command isn’t used and the table exist, an error will be displayed.
  • database:Name of the Database in which the table should be created

Also sounds very easy, right? Sorry, but this time I have to disappoint you. Creating tables has several more options, which I removed from the syntax above due to better readability. Following are additional options:

  • COLUMN NAMES: Provides columns and their data types for the table
  • COMMENT: Adds a comment to the table
  • PARTITIONED BY: Provides a partition key (or more) for the table. This is based on the column names
  • CLUSTERED BY: In addition to partitioning, tables can also be clustered into buckets.
  • STORED AS: Stores the table in a specific format – e.g. parquet.
  • LOCATION:Provides a user-specific location for the table

Hive knows several Datatypes. For numbers, they are:

  • Integers: tinyint, smallint, int, bigint
  • Floating-point: float, double, doubleprecision, decimal, decimalprecision

Other basic datatypes are:

  • string, binary, timestamp, date, char, varchar

Non-primitive datatypes are:

  • array, map, struct, union

As already mentioned several times during this tutorial series, Hive basically stores everything on HDFS as files. One of the parameters you can add in “CREATE TABLE” is “STORED AS”. HDFS knows several File formats, that have different benefits. You can start with a large text file, but for better performance, partitioned files in column formats are better. The different file formats possible are: Avro, Parquet, ORC, RCFile, JSONFile. The ideal file format should be selected on the relevant use-case.

Now, we were mainly focusing on how to create tables. However, there might also be the necessity to delete tables. This works with the following statement:

DROP TABLE [IF EXISTS] table_name 

Now, since we know everything we need to know for this, let’s play with Hive. Start your container again and launch the Data Analytics Studio UI. We now create several Tables, that should mimic the structure of a university.

First, let’s start with students. Students have some properties like name and age.

CREATE TABLE IF NOT EXISTS university.students
(studentid INT, firstname STRING, lastname STRING, birthday STRING, gender STRING)
STORED AS PARQUET;

Next, we create a table for classes.

CREATE TABLE IF NOT EXISTS university.classes
(classid INT, studyname STRING, classname STRING)
STORED AS PARQUET;

Next, we need to create a cross-table that creates relations between students and classes.

CREATE TABLE IF NOT EXISTS university.enrollment
(classid INT, studentid INT)
STORED AS PARQUET;

Last, but not least, each student should have a mark when going for a class. Therefore, we create another cross-table between the classid and studentid.

CREATE TABLE IF NOT EXISTS university.marks
(classid INT, studentid INT, mark INT)
STORED AS PARQUET;

In Data Analytics Studio, this should look like the following:

HiveQL Sample

Now, we’ve finished the tables. In the next tutorial, we will insert some data into the tables.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Actually, there are no “real” database in Hive or Hadoop (unless you install HBase or so). All data is stored in files. However, with HiveQL, you get the feeling that it actually are databases. Therefore, we start by creating “databases” as a first start with Hive.

Working with the Database in Hive

The syntax for creating databases is very easy:

CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT] [LOCATION] [WITH DBPROPERTIES] 

The easiest way to write is “CREATE DATABASE db”. All other options are optional:

  • IF NOT EXISTS: The new database is only created if it doesn’t exist already. If you don’t use this option and the database already exists, an error would be displayed.
  • COMMENT: Provides a comment for a new database, in case this is needed for further explanation.
  • LOCATION: Specifies a hdfs path for the new database.
  • WITH DBPROPERTIES: Specifies some additional properties for the database.

Deleting a database is also very similar to this. You can do this with the following syntax:

DROP DATABASE [IF EXISTS] 
database_name [CASCADE or RESTRICT]

Also here, the statement “DROP DATABASE db” is the easiest one. All other options are optional:

  • IF EXISTS: Prior deletion, checks if the database actually exists. If this command isn’t used and the database doesn’t exist, an error will be displayed.
  • CASCADE: Deletes tables first. If a database is marked for deletion but contains tables, an error would be produced otherwise
  • RESTRICT: Standard behavior for deletion. Would run into an error if tables exist in a database.

Easy, isn’t it? Now, let’s have some fun with Hive and create some Databases. Start the container we’ve created last time with Docker. Starting takes some time. Also make sure to start the hdp-proxy container. If you run into a bad gateway error (502), just wait some time and re-try. After that, you should be able to access Ambari again. Scroll down to “Data Analytics Studio” and click on “Data Analytics Studio UI”. You are then re-directed to a UI where you can write queries. The following image shows the welcome screen. Note: you might get to an error page, since it might wants to re-direct you to a wrong URL. exchange the URL with “127.0.0.1:30800” and you should be fine.

Hortonworks Data Analytics Studio

First, let’s create a new database. We call our database “university”. Note that we will use this database also over the next tutorials. Simply click on “Compose Query”. You should now see the query editor. Enter the following code:

CREATE DATABASE IF NOT EXISTS university;

After clicking “Execute”, your database will be created. The following image shows this:

Data Analytics Studio Query Editor

We also add the “IF NOT EXISTS” statement in order to not run into errors. We can delete the database with the DROP statement:

DROP DATABASE IF EXISTS university;

Re-create the database in case you did the drop now. In the next tutorial, we will look into how to work with tables in Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

We have learned about the basics of Hive in the previous tutorial. In the following tutorials, we will use the Hortonworks Sandbox to use Hive. Hortonworks is one of the Hadoop distributions (next to Cloudera and MapR) and a pre-configured environment. There is no need for additional setup or installations. Hortonworks is delivered via different VMs or also as a Docker container. We use this, as it is the easiest way (and you don’t need to install any VM tools). To get started, download the latest Docker environment for your computer/mac: https://www.docker.com/get-started. Then, we can get started to setup the Hortonworks Sandbox with Docker.

Follow the installation routine throughout, it is easy and straight forward. Once done, download the Hortonworks image fromhttps://hortonworks.com/downloads/#sandbox

As an install type, select “Docker” and make sure that you have the latest version. As of writing this article, the current version of HDP (Hortonworks Data Platform) is 3.0. Once you have finished the download, execute the Docker file (on Linux and Mac: docker-deploy-hdp30.sh). After that, the script pulls several repositories. This takes some time, as it is several GB in size – so be patient!

The script also installs the HDP proxy tool, which might cause some errors. If you have whitespaces in your directories, you need to edit the HDP proxy sh file (e.g. with vim) and set all paths under “”. Then, everything should be fine.

The next step is to change the admin password in your image. To do this, you need to SSH into the machine with the following command:

docker exec -it sandbox-hdp /bin/bash

Execute the following command:

ambari-admin-password-reset

Now re-type the passwords and the services will re-start. After that, you are ready to use HDP 3.0. To access your hdp, use your local ip (127.0.0.1) with port 8080. Now, you should see the Ambari Login screen. Enter “admin” and your password (the one you reset in the step before). You are now re-directed to your administration interface and should see the following screen:

The Hortonworks Ambari Environment shows services that aren't started yet in the Hortonworks Sandbox with Docker
HDP 3.0 with Ambari

You might see that most of your services are somewhat red. In order to get them to work, you need to restart them. This takes some time again, so you need to be patient here. Once your services turned green, we are ready to go. As you can see, setting up the Hortonworks Sandbox with Docker is really easy and straight forward.

Have fun exploring HDP – we will use it in the next tutorial, where we will look at how Hive abstracts Tables and Databases.

https://www.flickr.com/photos/karen_roe/32417107542

This is the kick-off to the Apache Hive Tutorial. Over the next weeks, I will post different tutorials on how to use Hive. Hive is a key component of Hadoop and we will today start with a general description of it.

What is Apache Hive?

Basically, what is Hive all about? Hive is a distributed query engine and language (called HiveQL) for Hadoop. Its main purpose is to enable a large number of people working with data stored in Hadoop. Therefore, Facebook introduced Hive for their analysts. Below you can see the typical Dataflow in an Hive project.

Hive Data Flow

The above image shows how the workflow goes: first, a Hive client sends a request to the Hive Server. After that, the driver takes over and submits to the JobClient. Jobs are then executed on a Hadoop or Spark Cluster. In our samples over the next tutorials, we will however use the Web UI from Hortonworks. But we will have a look at that later. First, let’s have a look at another component: HCatalog.

HCatalog is a service that makes it easy to use Hive. With this, files on HDFS are abstracted to look like databases and tables. HCatalog is therefore a metadata repository about the files on HDFS. Other tools on Hadoop or Spark take advantage of this and use HCatalog.

With traditional Datawarehouse or RDBMS sytems, one worked in Databases and SQL was the language how to access data from these systems. Hive provides the HiveQL (which we will look at more detailed in the coming blog posts). HiveQL basically works on Hadoop files, such as plain text files, OCR or Parquet.

One key aspect of Hive is that it is mainly read-oriented. This means that you don’t update data, as everything you do in Hadoop is built for analytics. Hive still provides the possibility to update data, but this is rather done as an append update (meaning, that the original data isn’t altered as in contrast to RDBMS systems).

Apache Hive Security

One key element of Hive is security. It all enterprise environments, it is very important to secure your tables against different kind of access. Hive therefore supports different options:

  • Storage-based authorization: Hive doesn’t care about the authorization. Auth is being handled via the Storage Layer (ACLs in Cloud Bucket/Object Store or HDFS ACLs)
  • Standard-based Autorization via HiveServer2 over Databases: Storage-based authorization is all or nothing from a table – not fine-grained enough. Hive can also work with fine-grained auth from databases to only show colums/rows relevant to the user
  • Authorization via Ranger or Sentry: Apache Projects that do advanced authorization in Hadoop and abstract the authorization issues •Allows advanced rules and access to data

To work with Hive, you will typically use HiveQL. In the next tutorial, we will have a look on how to setup an environment where you can work with Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Header image: https://www.flickr.com/photos/karen_roe/32417107542