Posts

In the last tutorial about Hive, we had a look at how to insert data into hive. Now, that we have the data in Hive, we look at how to access this data. Querying data is very easy and can be done by some easy steps with the hive SELECT statement. In it’s easiest form, it looks like the following:

The Hive select statement

SELECT fieldnames FROM tablename; 
  • fieldnames: Name of the fields to query – e.g. ID, firstname, lastname
  • tablename: Name of the table to query the fields from – e.g. students

Of course, there is much more than that. After the tablename, you could specify a “WHERE” statement. This statement is capable of filtering data on specific criterias. A sample would be to limit the number of results only to students that are younger than 18. The following describes the Where-Statement.

SELECT fieldnames FROM tablename WHERE wherestatement; 

Often, you want to order data in a certain way. This can be achieved by the “ORDER BY” statement. With this statement, you can order by the specified fields. A sample would be to sort them based on age. The statement is written like this:

SELECT fieldnames FROM tablename ORDER BY orderstatement;

Often, you only want to have a certain number of results returned. This can be done with the “LIMIT” statement. E.g. you only want to have the 10 most relevant items returned:

 SELECT fieldnames FROM tablename LIMIT number; 

Another common case is to group results by specific fields. This is useful if you want to create some functions on that:

SELECT fieldnames FROM tablename GROUP BY fieldnames;

Let’s now look at some code below:

SELECT * FROM university.students;
SELECT * FROM university.students WHERE gender = "female";
SELECT * FROM university.students WHERE gender = "female" ORDER BY lastname;
SELECT * FROM (
   SELECT lastname, gender FROM university.students) sq;

In the first query, we want to return all students from the table. The second query only returns female students. The third one is ordering them by age. The last query shows that queries can be based on queries.

There are much more functions that can be applied and they can be chained (e.g. ORDER BY and WHERE).

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

In the previous tutorial, we learnt how to create tables. Now, it is about time to add some data to our tables. Therefore, we will look at how hive insert data into tables with using the Insert statement. This is done straight forward:

Hive insert data into tables

INSERT INTO TABLE name VALUES [values]
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename
  • values: The values to insert into the database. All values for the table must be provided, it is not possible to skip values (like in some other SQL systems)

Another possibility is to insert tables from files. This is done with the following statement:

LOAD DATA INPATH path INTO TABLE name 
  • path: the path of the file to insert from. Typically, with Hive, this would be a file on the hdfs system
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename

It is also possible to insert data from a sub-query. This can be done with this statement:

INSERT INTO TABLE name [select statement]

The only difference to the first statement is that instead of the “values”, we create a select statement. The select statement is described in a later tutorial.

Now, let’s use the sample from the last tutorial and insert some data into our databases.

For the students, we enter this code:

INSERT INTO TABLE university.students VAlUES (1, "Mario", "Meir-Huber", "01/03/1984", "male"),
(2, "Max", "Musterman", "01/01/1988", "male"), (3, "Anna", "Studihard", "05/05/1989", "female"),
(4, "Sara", "Supersmart", "06/06/1990", "female");

For the classes, we enter the following:

INSERT INTO TABLE university.classes VAlUES (1, "Business", "Accounting 1"), (2, "IT", "Software Development 1");

And for the enrolment, we enter the following:

INSERT INTO TABLE university.enrollment VALUES (1, 1), (2, 3), (1, 3), (1, 2), (1, 4), (2, 4)

Now we are all set and can start querying our data. This will happen in the next tutorial.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

In the last tutorial, we looked at how to create databases in Hive. This time, we look at how to create a table in hive. The syntax to create a new Table is as follows:

Create a table in Hive

CREATE TABLE [IF NOT EXISTS] [database.]database_name 
  • IF NOT EXISTS: Prior creation, checks if the database actually exists. If this command isn’t used and the table exist, an error will be displayed.
  • database:Name of the Database in which the table should be created

Also sounds very easy, right? Sorry, but this time I have to disappoint you. Creating tables has several more options, which I removed from the syntax above due to better readability. Following are additional options:

  • COLUMN NAMES: Provides columns and their data types for the table
  • COMMENT: Adds a comment to the table
  • PARTITIONED BY: Provides a partition key (or more) for the table. This is based on the column names
  • CLUSTERED BY: In addition to partitioning, tables can also be clustered into buckets.
  • STORED AS: Stores the table in a specific format – e.g. parquet.
  • LOCATION:Provides a user-specific location for the table

Hive knows several Datatypes. For numbers, they are:

  • Integers: tinyint, smallint, int, bigint
  • Floating-point: float, double, doubleprecision, decimal, decimalprecision

Other basic datatypes are:

  • string, binary, timestamp, date, char, varchar

Non-primitive datatypes are:

  • array, map, struct, union

As already mentioned several times during this tutorial series, Hive basically stores everything on HDFS as files. One of the parameters you can add in “CREATE TABLE” is “STORED AS”. HDFS knows several File formats, that have different benefits. You can start with a large text file, but for better performance, partitioned files in column formats are better. The different file formats possible are: Avro, Parquet, ORC, RCFile, JSONFile. The ideal file format should be selected on the relevant use-case.

Now, we were mainly focusing on how to create tables. However, there might also be the necessity to delete tables. This works with the following statement:

DROP TABLE [IF EXISTS] table_name 

Now, since we know everything we need to know for this, let’s play with Hive. Start your container again and launch the Data Analytics Studio UI. We now create several Tables, that should mimic the structure of a university.

First, let’s start with students. Students have some properties like name and age.

CREATE TABLE IF NOT EXISTS university.students
(studentid INT, firstname STRING, lastname STRING, birthday STRING, gender STRING)
STORED AS PARQUET;

Next, we create a table for classes.

CREATE TABLE IF NOT EXISTS university.classes
(classid INT, studyname STRING, classname STRING)
STORED AS PARQUET;

Next, we need to create a cross-table that creates relations between students and classes.

CREATE TABLE IF NOT EXISTS university.enrollment
(classid INT, studentid INT)
STORED AS PARQUET;

Last, but not least, each student should have a mark when going for a class. Therefore, we create another cross-table between the classid and studentid.

CREATE TABLE IF NOT EXISTS university.marks
(classid INT, studentid INT, mark INT)
STORED AS PARQUET;

In Data Analytics Studio, this should look like the following:

HiveQL Sample

Now, we’ve finished the tables. In the next tutorial, we will insert some data into the tables.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Actually, there are no “real” database in Hive or Hadoop (unless you install HBase or so). All data is stored in files. However, with HiveQL, you get the feeling that it actually are databases. Therefore, we start by creating “databases” as a first start with Hive.

Working with the Database in Hive

The syntax for creating databases is very easy:

CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT] [LOCATION] [WITH DBPROPERTIES] 

The easiest way to write is “CREATE DATABASE db”. All other options are optional:

  • IF NOT EXISTS: The new database is only created if it doesn’t exist already. If you don’t use this option and the database already exists, an error would be displayed.
  • COMMENT: Provides a comment for a new database, in case this is needed for further explanation.
  • LOCATION: Specifies a hdfs path for the new database.
  • WITH DBPROPERTIES: Specifies some additional properties for the database.

Deleting a database is also very similar to this. You can do this with the following syntax:

DROP DATABASE [IF EXISTS] 
database_name [CASCADE or RESTRICT]

Also here, the statement “DROP DATABASE db” is the easiest one. All other options are optional:

  • IF EXISTS: Prior deletion, checks if the database actually exists. If this command isn’t used and the database doesn’t exist, an error will be displayed.
  • CASCADE: Deletes tables first. If a database is marked for deletion but contains tables, an error would be produced otherwise
  • RESTRICT: Standard behavior for deletion. Would run into an error if tables exist in a database.

Easy, isn’t it? Now, let’s have some fun with Hive and create some Databases. Start the container we’ve created last time with Docker. Starting takes some time. Also make sure to start the hdp-proxy container. If you run into a bad gateway error (502), just wait some time and re-try. After that, you should be able to access Ambari again. Scroll down to “Data Analytics Studio” and click on “Data Analytics Studio UI”. You are then re-directed to a UI where you can write queries. The following image shows the welcome screen. Note: you might get to an error page, since it might wants to re-direct you to a wrong URL. exchange the URL with “127.0.0.1:30800” and you should be fine.

Hortonworks Data Analytics Studio

First, let’s create a new database. We call our database “university”. Note that we will use this database also over the next tutorials. Simply click on “Compose Query”. You should now see the query editor. Enter the following code:

CREATE DATABASE IF NOT EXISTS university;

After clicking “Execute”, your database will be created. The following image shows this:

Data Analytics Studio Query Editor

We also add the “IF NOT EXISTS” statement in order to not run into errors. We can delete the database with the DROP statement:

DROP DATABASE IF EXISTS university;

Re-create the database in case you did the drop now. In the next tutorial, we will look into how to work with tables in Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

We have learned about the basics of Hive in the previous tutorial. In the following tutorials, we will use the Hortonworks Sandbox to use Hive. Hortonworks is one of the Hadoop distributions (next to Cloudera and MapR) and a pre-configured environment. There is no need for additional setup or installations. Hortonworks is delivered via different VMs or also as a Docker container. We use this, as it is the easiest way (and you don’t need to install any VM tools). To get started, download the latest Docker environment for your computer/mac: https://www.docker.com/get-started. Then, we can get started to setup the Hortonworks Sandbox with Docker.

Follow the installation routine throughout, it is easy and straight forward. Once done, download the Hortonworks image fromhttps://hortonworks.com/downloads/#sandbox

As an install type, select “Docker” and make sure that you have the latest version. As of writing this article, the current version of HDP (Hortonworks Data Platform) is 3.0. Once you have finished the download, execute the Docker file (on Linux and Mac: docker-deploy-hdp30.sh). After that, the script pulls several repositories. This takes some time, as it is several GB in size – so be patient!

The script also installs the HDP proxy tool, which might cause some errors. If you have whitespaces in your directories, you need to edit the HDP proxy sh file (e.g. with vim) and set all paths under “”. Then, everything should be fine.

The next step is to change the admin password in your image. To do this, you need to SSH into the machine with the following command:

docker exec -it sandbox-hdp /bin/bash

Execute the following command:

ambari-admin-password-reset

Now re-type the passwords and the services will re-start. After that, you are ready to use HDP 3.0. To access your hdp, use your local ip (127.0.0.1) with port 8080. Now, you should see the Ambari Login screen. Enter “admin” and your password (the one you reset in the step before). You are now re-directed to your administration interface and should see the following screen:

The Hortonworks Ambari Environment shows services that aren't started yet in the Hortonworks Sandbox with Docker
HDP 3.0 with Ambari

You might see that most of your services are somewhat red. In order to get them to work, you need to restart them. This takes some time again, so you need to be patient here. Once your services turned green, we are ready to go. As you can see, setting up the Hortonworks Sandbox with Docker is really easy and straight forward.

Have fun exploring HDP – we will use it in the next tutorial, where we will look at how Hive abstracts Tables and Databases.

One of the easiest to use tools in Hadoop is Hive. Hive is very similar to SQL and is easy to learn for those that have a strong SQL background. Apache Hive is a data-warehousing tool for Hadoop, focusing on large datasets and how to create a structure on them.

Hive queries are written in HiveQL. HiveQL is very similar to SQL, but not the same. As already mentioned, HiveQL translates to MapReduce and therefore comes with minor performance trade-offs. HiveQL can be extended by custom code and MapReduce queries. This is useful, when additional performance is required.

The following listings will show some Hive queries. The first listing will show how to query two rows from a dataset.

hive> SELECT column1, column2 FROM dataset2 5

4 9

5 7

5 9

Listing 2: simple Hive query

The next sample shows how to include a where-clause.

hive> SELECT DISTINCT column1 FROM dataset WHERE column2 = 91

Listing 3: where in Hive

HCatalog is an abstract table manager for Hadoop. The target of HCatalog is to make it easier for users to work with data. Users see everything like it would be a relational database. To access HCatalog, it is possible to use a Rest API.