Posts

In the last tutorial, we looked at how to create databases in Hive. This time, we look at how to create a table in hive. The syntax to create a new Table is as follows:

Create a table in Hive

CREATE TABLE [IF NOT EXISTS] [database.]database_name 
  • IF NOT EXISTS: Prior creation, checks if the database actually exists. If this command isn’t used and the table exist, an error will be displayed.
  • database:Name of the Database in which the table should be created

Also sounds very easy, right? Sorry, but this time I have to disappoint you. Creating tables has several more options, which I removed from the syntax above due to better readability. Following are additional options:

  • COLUMN NAMES: Provides columns and their data types for the table
  • COMMENT: Adds a comment to the table
  • PARTITIONED BY: Provides a partition key (or more) for the table. This is based on the column names
  • CLUSTERED BY: In addition to partitioning, tables can also be clustered into buckets.
  • STORED AS: Stores the table in a specific format – e.g. parquet.
  • LOCATION:Provides a user-specific location for the table

Hive knows several Datatypes. For numbers, they are:

  • Integers: tinyint, smallint, int, bigint
  • Floating-point: float, double, doubleprecision, decimal, decimalprecision

Other basic datatypes are:

  • string, binary, timestamp, date, char, varchar

Non-primitive datatypes are:

  • array, map, struct, union

As already mentioned several times during this tutorial series, Hive basically stores everything on HDFS as files. One of the parameters you can add in “CREATE TABLE” is “STORED AS”. HDFS knows several File formats, that have different benefits. You can start with a large text file, but for better performance, partitioned files in column formats are better. The different file formats possible are: Avro, Parquet, ORC, RCFile, JSONFile. The ideal file format should be selected on the relevant use-case.

Now, we were mainly focusing on how to create tables. However, there might also be the necessity to delete tables. This works with the following statement:

DROP TABLE [IF EXISTS] table_name 

Now, since we know everything we need to know for this, let’s play with Hive. Start your container again and launch the Data Analytics Studio UI. We now create several Tables, that should mimic the structure of a university.

First, let’s start with students. Students have some properties like name and age.

CREATE TABLE IF NOT EXISTS university.students
(studentid INT, firstname STRING, lastname STRING, birthday STRING, gender STRING)
STORED AS PARQUET;

Next, we create a table for classes.

CREATE TABLE IF NOT EXISTS university.classes
(classid INT, studyname STRING, classname STRING)
STORED AS PARQUET;

Next, we need to create a cross-table that creates relations between students and classes.

CREATE TABLE IF NOT EXISTS university.enrollment
(classid INT, studentid INT)
STORED AS PARQUET;

Last, but not least, each student should have a mark when going for a class. Therefore, we create another cross-table between the classid and studentid.

CREATE TABLE IF NOT EXISTS university.marks
(classid INT, studentid INT, mark INT)
STORED AS PARQUET;

In Data Analytics Studio, this should look like the following:

HiveQL Sample

Now, we’ve finished the tables. In the next tutorial, we will insert some data into the tables.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Actually, there are no “real” database in Hive or Hadoop (unless you install HBase or so). All data is stored in files. However, with HiveQL, you get the feeling that it actually are databases. Therefore, we start by creating “databases” as a first start with Hive.

Working with the Database in Hive

The syntax for creating databases is very easy:

CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT] [LOCATION] [WITH DBPROPERTIES] 

The easiest way to write is “CREATE DATABASE db”. All other options are optional:

  • IF NOT EXISTS: The new database is only created if it doesn’t exist already. If you don’t use this option and the database already exists, an error would be displayed.
  • COMMENT: Provides a comment for a new database, in case this is needed for further explanation.
  • LOCATION: Specifies a hdfs path for the new database.
  • WITH DBPROPERTIES: Specifies some additional properties for the database.

Deleting a database is also very similar to this. You can do this with the following syntax:

DROP DATABASE [IF EXISTS] 
database_name [CASCADE or RESTRICT]

Also here, the statement “DROP DATABASE db” is the easiest one. All other options are optional:

  • IF EXISTS: Prior deletion, checks if the database actually exists. If this command isn’t used and the database doesn’t exist, an error will be displayed.
  • CASCADE: Deletes tables first. If a database is marked for deletion but contains tables, an error would be produced otherwise
  • RESTRICT: Standard behavior for deletion. Would run into an error if tables exist in a database.

Easy, isn’t it? Now, let’s have some fun with Hive and create some Databases. Start the container we’ve created last time with Docker. Starting takes some time. Also make sure to start the hdp-proxy container. If you run into a bad gateway error (502), just wait some time and re-try. After that, you should be able to access Ambari again. Scroll down to “Data Analytics Studio” and click on “Data Analytics Studio UI”. You are then re-directed to a UI where you can write queries. The following image shows the welcome screen. Note: you might get to an error page, since it might wants to re-direct you to a wrong URL. exchange the URL with “127.0.0.1:30800” and you should be fine.

Hortonworks Data Analytics Studio

First, let’s create a new database. We call our database “university”. Note that we will use this database also over the next tutorials. Simply click on “Compose Query”. You should now see the query editor. Enter the following code:

CREATE DATABASE IF NOT EXISTS university;

After clicking “Execute”, your database will be created. The following image shows this:

Data Analytics Studio Query Editor

We also add the “IF NOT EXISTS” statement in order to not run into errors. We can delete the database with the DROP statement:

DROP DATABASE IF EXISTS university;

Re-create the database in case you did the drop now. In the next tutorial, we will look into how to work with tables in Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

https://www.flickr.com/photos/karen_roe/32417107542

This is the kick-off to the Apache Hive Tutorial. Over the next weeks, I will post different tutorials on how to use Hive. Hive is a key component of Hadoop and we will today start with a general description of it.

What is Apache Hive?

Basically, what is Hive all about? Hive is a distributed query engine and language (called HiveQL) for Hadoop. Its main purpose is to enable a large number of people working with data stored in Hadoop. Therefore, Facebook introduced Hive for their analysts. Below you can see the typical Dataflow in an Hive project.

Hive Data Flow

The above image shows how the workflow goes: first, a Hive client sends a request to the Hive Server. After that, the driver takes over and submits to the JobClient. Jobs are then executed on a Hadoop or Spark Cluster. In our samples over the next tutorials, we will however use the Web UI from Hortonworks. But we will have a look at that later. First, let’s have a look at another component: HCatalog.

HCatalog is a service that makes it easy to use Hive. With this, files on HDFS are abstracted to look like databases and tables. HCatalog is therefore a metadata repository about the files on HDFS. Other tools on Hadoop or Spark take advantage of this and use HCatalog.

With traditional Datawarehouse or RDBMS sytems, one worked in Databases and SQL was the language how to access data from these systems. Hive provides the HiveQL (which we will look at more detailed in the coming blog posts). HiveQL basically works on Hadoop files, such as plain text files, OCR or Parquet.

One key aspect of Hive is that it is mainly read-oriented. This means that you don’t update data, as everything you do in Hadoop is built for analytics. Hive still provides the possibility to update data, but this is rather done as an append update (meaning, that the original data isn’t altered as in contrast to RDBMS systems).

Apache Hive Security

One key element of Hive is security. It all enterprise environments, it is very important to secure your tables against different kind of access. Hive therefore supports different options:

  • Storage-based authorization: Hive doesn’t care about the authorization. Auth is being handled via the Storage Layer (ACLs in Cloud Bucket/Object Store or HDFS ACLs)
  • Standard-based Autorization via HiveServer2 over Databases: Storage-based authorization is all or nothing from a table – not fine-grained enough. Hive can also work with fine-grained auth from databases to only show colums/rows relevant to the user
  • Authorization via Ranger or Sentry: Apache Projects that do advanced authorization in Hadoop and abstract the authorization issues •Allows advanced rules and access to data

To work with Hive, you will typically use HiveQL. In the next tutorial, we will have a look on how to setup an environment where you can work with Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Header image: https://www.flickr.com/photos/karen_roe/32417107542