In the last tutorial, we looked at how to create databases in Hive. This time, we look into tables. The syntax to create a new Table is as follows:
CREATE TABLE [IF NOT EXISTS] [database.]database_name
- IF NOT EXISTS: Prior creation, checks if the database actually exists. If this command isn’t used and the table exist, an error will be displayed.
- database:Name of the Database in which the table should be created
Also sounds very easy, right? Sorry, but this time I have to disappoint you. Creating tables has several more options, which I removed from the syntax above due to better readability. Following are additional options:
- COLUMN NAMES: Provides columns and their data types for the table
- COMMENT: Adds a comment to the table
- PARTITIONED BY: Provides a partition key (or more) for the table. This is based on the column names
- CLUSTERED BY: In addition to partitioning, tables can also be clustered into buckets.
- STORED AS: Stores the table in a specific format – e.g. parquet.
- LOCATION:Provides a user-specific location for the table
Hive knows several Datatypes. For numbers, they are:
- Integers: tinyint, smallint, int, bigint
- Floating-point: float, double, doubleprecision, decimal, decimalprecision
Other basic datatypes are:
- string, binary, timestamp, date, char, varchar
Non-primitive datatypes are:
- array, map, struct, union
As already mentioned several times during this tutorial series, Hive basically stores everything on HDFS as files. One of the parameters you can add in “CREATE TABLE” is “STORED AS”. HDFS knows several File formats, that have different benefits. You can start with a large text file, but for better performance, partitioned files in column formats are better. The different file formats possible are: Avro, Parquet, ORC, RCFile, JSONFile. The ideal file format should be selected on the relevant use-case.
Now, we were mainly focusing on how to create tables. However, there might also be the necessity to delete tables. This works with the following statement:
DROP TABLE [IF EXISTS] table_name
Now, since we know everything we need to know for this, let’s play with Hive. Start your container again and launch the Data Analytics Studio UI. We now create several Tables, that should mimic the structure of a university.
First, let’s start with students. Students have some properties like name and age.
CREATE TABLE IF NOT EXISTS university.students (studentid INT, firstname STRING, lastname STRING, birthday STRING, gender STRING) STORED AS PARQUET;
Next, we create a table for classes.
CREATE TABLE IF NOT EXISTS university.classes (classid INT, studyname STRING, classname STRING) STORED AS PARQUET;
Next, we need to create a cross-table that creates relations between students and classes.
CREATE TABLE IF NOT EXISTS university.enrollment (classid INT, studentid INT) STORED AS PARQUET;
Last, but not least, each student should have a mark when going for a class. Therefore, we create another cross-table between the classid and studentid.
CREATE TABLE IF NOT EXISTS university.marks (classid INT, studentid INT, mark INT) STORED AS PARQUET;
In Data Analytics Studio, this should look like the following:
Now, we’ve finished the tables. In the next tutorial, we will insert some data into the tables.