Posts

In the last tutorial about Hive, we had a look at how to insert data into hive. Now, that we have the data in Hive, we look at how to access this data. Querying data is very easy and can be done by some easy steps with the hive SELECT statement. In it’s easiest form, it looks like the following:

The Hive select statement

SELECT fieldnames FROM tablename; 
  • fieldnames: Name of the fields to query – e.g. ID, firstname, lastname
  • tablename: Name of the table to query the fields from – e.g. students

Of course, there is much more than that. After the tablename, you could specify a “WHERE” statement. This statement is capable of filtering data on specific criterias. A sample would be to limit the number of results only to students that are younger than 18. The following describes the Where-Statement.

SELECT fieldnames FROM tablename WHERE wherestatement; 

Often, you want to order data in a certain way. This can be achieved by the “ORDER BY” statement. With this statement, you can order by the specified fields. A sample would be to sort them based on age. The statement is written like this:

SELECT fieldnames FROM tablename ORDER BY orderstatement;

Often, you only want to have a certain number of results returned. This can be done with the “LIMIT” statement. E.g. you only want to have the 10 most relevant items returned:

 SELECT fieldnames FROM tablename LIMIT number; 

Another common case is to group results by specific fields. This is useful if you want to create some functions on that:

SELECT fieldnames FROM tablename GROUP BY fieldnames;

Let’s now look at some code below:

SELECT * FROM university.students;
SELECT * FROM university.students WHERE gender = "female";
SELECT * FROM university.students WHERE gender = "female" ORDER BY lastname;
SELECT * FROM (
   SELECT lastname, gender FROM university.students) sq;

In the first query, we want to return all students from the table. The second query only returns female students. The third one is ordering them by age. The last query shows that queries can be based on queries.

There are much more functions that can be applied and they can be chained (e.g. ORDER BY and WHERE).

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

In the previous tutorial, we learnt how to create tables. Now, it is about time to add some data to our tables. Therefore, we will look at how hive insert data into tables with using the Insert statement. This is done straight forward:

Hive insert data into tables

INSERT INTO TABLE name VALUES [values]
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename
  • values: The values to insert into the database. All values for the table must be provided, it is not possible to skip values (like in some other SQL systems)

Another possibility is to insert tables from files. This is done with the following statement:

LOAD DATA INPATH path INTO TABLE name 
  • path: the path of the file to insert from. Typically, with Hive, this would be a file on the hdfs system
  • name: Name of the table to insert into. This can also be pre-fixed with database.tablename

It is also possible to insert data from a sub-query. This can be done with this statement:

INSERT INTO TABLE name [select statement]

The only difference to the first statement is that instead of the “values”, we create a select statement. The select statement is described in a later tutorial.

Now, let’s use the sample from the last tutorial and insert some data into our databases.

For the students, we enter this code:

INSERT INTO TABLE university.students VAlUES (1, "Mario", "Meir-Huber", "01/03/1984", "male"),
(2, "Max", "Musterman", "01/01/1988", "male"), (3, "Anna", "Studihard", "05/05/1989", "female"),
(4, "Sara", "Supersmart", "06/06/1990", "female");

For the classes, we enter the following:

INSERT INTO TABLE university.classes VAlUES (1, "Business", "Accounting 1"), (2, "IT", "Software Development 1");

And for the enrolment, we enter the following:

INSERT INTO TABLE university.enrollment VALUES (1, 1), (2, 3), (1, 3), (1, 2), (1, 4), (2, 4)

Now we are all set and can start querying our data. This will happen in the next tutorial.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

In the last tutorial, we looked at how to create databases in Hive. This time, we look at how to create a table in hive. The syntax to create a new Table is as follows:

Create a table in Hive

CREATE TABLE [IF NOT EXISTS] [database.]database_name 
  • IF NOT EXISTS: Prior creation, checks if the database actually exists. If this command isn’t used and the table exist, an error will be displayed.
  • database:Name of the Database in which the table should be created

Also sounds very easy, right? Sorry, but this time I have to disappoint you. Creating tables has several more options, which I removed from the syntax above due to better readability. Following are additional options:

  • COLUMN NAMES: Provides columns and their data types for the table
  • COMMENT: Adds a comment to the table
  • PARTITIONED BY: Provides a partition key (or more) for the table. This is based on the column names
  • CLUSTERED BY: In addition to partitioning, tables can also be clustered into buckets.
  • STORED AS: Stores the table in a specific format – e.g. parquet.
  • LOCATION:Provides a user-specific location for the table

Hive knows several Datatypes. For numbers, they are:

  • Integers: tinyint, smallint, int, bigint
  • Floating-point: float, double, doubleprecision, decimal, decimalprecision

Other basic datatypes are:

  • string, binary, timestamp, date, char, varchar

Non-primitive datatypes are:

  • array, map, struct, union

As already mentioned several times during this tutorial series, Hive basically stores everything on HDFS as files. One of the parameters you can add in “CREATE TABLE” is “STORED AS”. HDFS knows several File formats, that have different benefits. You can start with a large text file, but for better performance, partitioned files in column formats are better. The different file formats possible are: Avro, Parquet, ORC, RCFile, JSONFile. The ideal file format should be selected on the relevant use-case.

Now, we were mainly focusing on how to create tables. However, there might also be the necessity to delete tables. This works with the following statement:

DROP TABLE [IF EXISTS] table_name 

Now, since we know everything we need to know for this, let’s play with Hive. Start your container again and launch the Data Analytics Studio UI. We now create several Tables, that should mimic the structure of a university.

First, let’s start with students. Students have some properties like name and age.

CREATE TABLE IF NOT EXISTS university.students
(studentid INT, firstname STRING, lastname STRING, birthday STRING, gender STRING)
STORED AS PARQUET;

Next, we create a table for classes.

CREATE TABLE IF NOT EXISTS university.classes
(classid INT, studyname STRING, classname STRING)
STORED AS PARQUET;

Next, we need to create a cross-table that creates relations between students and classes.

CREATE TABLE IF NOT EXISTS university.enrollment
(classid INT, studentid INT)
STORED AS PARQUET;

Last, but not least, each student should have a mark when going for a class. Therefore, we create another cross-table between the classid and studentid.

CREATE TABLE IF NOT EXISTS university.marks
(classid INT, studentid INT, mark INT)
STORED AS PARQUET;

In Data Analytics Studio, this should look like the following:

HiveQL Sample

Now, we’ve finished the tables. In the next tutorial, we will insert some data into the tables.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Actually, there are no “real” database in Hive or Hadoop (unless you install HBase or so). All data is stored in files. However, with HiveQL, you get the feeling that it actually are databases. Therefore, we start by creating “databases” as a first start with Hive.

Working with the Database in Hive

The syntax for creating databases is very easy:

CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT] [LOCATION] [WITH DBPROPERTIES] 

The easiest way to write is “CREATE DATABASE db”. All other options are optional:

  • IF NOT EXISTS: The new database is only created if it doesn’t exist already. If you don’t use this option and the database already exists, an error would be displayed.
  • COMMENT: Provides a comment for a new database, in case this is needed for further explanation.
  • LOCATION: Specifies a hdfs path for the new database.
  • WITH DBPROPERTIES: Specifies some additional properties for the database.

Deleting a database is also very similar to this. You can do this with the following syntax:

DROP DATABASE [IF EXISTS] 
database_name [CASCADE or RESTRICT]

Also here, the statement “DROP DATABASE db” is the easiest one. All other options are optional:

  • IF EXISTS: Prior deletion, checks if the database actually exists. If this command isn’t used and the database doesn’t exist, an error will be displayed.
  • CASCADE: Deletes tables first. If a database is marked for deletion but contains tables, an error would be produced otherwise
  • RESTRICT: Standard behavior for deletion. Would run into an error if tables exist in a database.

Easy, isn’t it? Now, let’s have some fun with Hive and create some Databases. Start the container we’ve created last time with Docker. Starting takes some time. Also make sure to start the hdp-proxy container. If you run into a bad gateway error (502), just wait some time and re-try. After that, you should be able to access Ambari again. Scroll down to “Data Analytics Studio” and click on “Data Analytics Studio UI”. You are then re-directed to a UI where you can write queries. The following image shows the welcome screen. Note: you might get to an error page, since it might wants to re-direct you to a wrong URL. exchange the URL with “127.0.0.1:30800” and you should be fine.

Hortonworks Data Analytics Studio

First, let’s create a new database. We call our database “university”. Note that we will use this database also over the next tutorials. Simply click on “Compose Query”. You should now see the query editor. Enter the following code:

CREATE DATABASE IF NOT EXISTS university;

After clicking “Execute”, your database will be created. The following image shows this:

Data Analytics Studio Query Editor

We also add the “IF NOT EXISTS” statement in order to not run into errors. We can delete the database with the DROP statement:

DROP DATABASE IF EXISTS university;

Re-create the database in case you did the drop now. In the next tutorial, we will look into how to work with tables in Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

We have learned about the basics of Hive in the previous tutorial. In the following tutorials, we will use the Hortonworks Sandbox to use Hive. Hortonworks is one of the Hadoop distributions (next to Cloudera and MapR) and a pre-configured environment. There is no need for additional setup or installations. Hortonworks is delivered via different VMs or also as a Docker container. We use this, as it is the easiest way (and you don’t need to install any VM tools). To get started, download the latest Docker environment for your computer/mac: https://www.docker.com/get-started. Then, we can get started to setup the Hortonworks Sandbox with Docker.

Follow the installation routine throughout, it is easy and straight forward. Once done, download the Hortonworks image fromhttps://hortonworks.com/downloads/#sandbox

As an install type, select “Docker” and make sure that you have the latest version. As of writing this article, the current version of HDP (Hortonworks Data Platform) is 3.0. Once you have finished the download, execute the Docker file (on Linux and Mac: docker-deploy-hdp30.sh). After that, the script pulls several repositories. This takes some time, as it is several GB in size – so be patient!

The script also installs the HDP proxy tool, which might cause some errors. If you have whitespaces in your directories, you need to edit the HDP proxy sh file (e.g. with vim) and set all paths under “”. Then, everything should be fine.

The next step is to change the admin password in your image. To do this, you need to SSH into the machine with the following command:

docker exec -it sandbox-hdp /bin/bash

Execute the following command:

ambari-admin-password-reset

Now re-type the passwords and the services will re-start. After that, you are ready to use HDP 3.0. To access your hdp, use your local ip (127.0.0.1) with port 8080. Now, you should see the Ambari Login screen. Enter “admin” and your password (the one you reset in the step before). You are now re-directed to your administration interface and should see the following screen:

The Hortonworks Ambari Environment shows services that aren't started yet in the Hortonworks Sandbox with Docker
HDP 3.0 with Ambari

You might see that most of your services are somewhat red. In order to get them to work, you need to restart them. This takes some time again, so you need to be patient here. Once your services turned green, we are ready to go. As you can see, setting up the Hortonworks Sandbox with Docker is really easy and straight forward.

Have fun exploring HDP – we will use it in the next tutorial, where we will look at how Hive abstracts Tables and Databases.

https://www.flickr.com/photos/karen_roe/32417107542

This is the kick-off to the Apache Hive Tutorial. Over the next weeks, I will post different tutorials on how to use Hive. Hive is a key component of Hadoop and we will today start with a general description of it.

What is Apache Hive?

Basically, what is Hive all about? Hive is a distributed query engine and language (called HiveQL) for Hadoop. Its main purpose is to enable a large number of people working with data stored in Hadoop. Therefore, Facebook introduced Hive for their analysts. Below you can see the typical Dataflow in an Hive project.

Hive Data Flow

The above image shows how the workflow goes: first, a Hive client sends a request to the Hive Server. After that, the driver takes over and submits to the JobClient. Jobs are then executed on a Hadoop or Spark Cluster. In our samples over the next tutorials, we will however use the Web UI from Hortonworks. But we will have a look at that later. First, let’s have a look at another component: HCatalog.

HCatalog is a service that makes it easy to use Hive. With this, files on HDFS are abstracted to look like databases and tables. HCatalog is therefore a metadata repository about the files on HDFS. Other tools on Hadoop or Spark take advantage of this and use HCatalog.

With traditional Datawarehouse or RDBMS sytems, one worked in Databases and SQL was the language how to access data from these systems. Hive provides the HiveQL (which we will look at more detailed in the coming blog posts). HiveQL basically works on Hadoop files, such as plain text files, OCR or Parquet.

One key aspect of Hive is that it is mainly read-oriented. This means that you don’t update data, as everything you do in Hadoop is built for analytics. Hive still provides the possibility to update data, but this is rather done as an append update (meaning, that the original data isn’t altered as in contrast to RDBMS systems).

Apache Hive Security

One key element of Hive is security. It all enterprise environments, it is very important to secure your tables against different kind of access. Hive therefore supports different options:

  • Storage-based authorization: Hive doesn’t care about the authorization. Auth is being handled via the Storage Layer (ACLs in Cloud Bucket/Object Store or HDFS ACLs)
  • Standard-based Autorization via HiveServer2 over Databases: Storage-based authorization is all or nothing from a table – not fine-grained enough. Hive can also work with fine-grained auth from databases to only show colums/rows relevant to the user
  • Authorization via Ranger or Sentry: Apache Projects that do advanced authorization in Hadoop and abstract the authorization issues •Allows advanced rules and access to data

To work with Hive, you will typically use HiveQL. In the next tutorial, we will have a look on how to setup an environment where you can work with Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Header image: https://www.flickr.com/photos/karen_roe/32417107542

Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment.

When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension .pig) that consist of a series of operations and transformations, that create output data from input data. Pig itself transforms these operations and transformations to MapReduce functions. The set of operations and transformations available by the language can easily be extended via custom code. When compared to the performance of “pure” MapReduce, Pig is a bit slower, but still very close to the native MapReduce performance. Especially for that not experienced in MapReduce, Pig is a great tool (and ways easier to learn than MapReduce)

When writing a Pig application, this application can easily be executed as a script in the Hadoop environment. Especially when using the previously demonstrated Hadoop VM’s, it is easy to get started. Another possibility is to work with Grunt, which allows us to execute Pig commands in the console. The third possibility to run Pig is to embed them in a Java application.

The question is, what differentiates Pig from SQL/Hive. First, Pig is a data-flow language. It is oriented on the data and how it is transformed from one statement to another. It works on a step-by-step iteration and transforms data. Another difference is that SQL needs a schema, but Pig doesn’t. The only dependency is that data needs to be able to work with it in parallel.

The table below will show a sample program. We will look at the possibilities within the next blog posts.

A = LOAD ‘student‘ USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)

Hadoop is one of the most popular Big Data technologies, or maybe the key Big Data technology. Due to large demand for Hadoop, I’ve decided to write a short Hadoop tutorial series here. In the next weeks, I will write several articles on the Hadoop platform and key technologies.

When we talk about Hadoop, we don’t talk about one specific software or a service. The Hadoop project features several projects, each of them serving different topics in the Big Data ecosystem. When handling Data, Hadoop is very different to traditional RDBMS systems. Key differences are:

  • Hadoop is about large amounts of data. Traditional database systems were only about some gigabyte or terabyte of data, Hadoop can handle much more. Petabytes are not a problem for Hadoop
  • RDBMS work with an interactive access to data, whereas Hadoop is batch-oriented.
  • With traditional database systems, the approach was “read many, write many”. That means, that data gets written often but also modified often. With Hadoop, this is different: the approach now is “write once, read many”. This means that data is written once and then never gets changed. The only purpose is to read the data for analytics.
  • RDBMS systems have schemas. When you design an application, you first need to create the schema of the database. With Hadoop, this is different: the schema is very flexible, it is actually schema-less
  • Last but not least, Hadoop scales linear. If you add 10% more compute capacity, you will get about the same amount of performance. RDBMS are different; at a certain point, scaling them gets really difficult.

Central to Hadoop is the Map/Reduce algorithm. This algorithm was usually introduced by Google to power their search engine. However, the algorithm turned out to be very efficient for distributed systems, so it is nowadays used in many technologies. When you run queries in Hadoop with languages such as Hive or Pig (I will explain them later), these queries are translated to Map/Reduce algorithms by Hadoop. The following figure shows the Map/Reduce algorithm:

Map Reduce function
Map Reduce function

The Map/Reduce function has some steps:

  1. All input data is distributed to the Map functions
  2. The Map functions are running in parallel. The distribution and failover is handled entirely by Hadoop.
  3. The Map functions emit data to a temporary storage
  4. The Reduce function now calculates the temporary stored data

A typical sample is the word-count. With word-count, input data as text is put to a Map function. The Map function adds all words of the same kind to a list in the temporary store. The reduce-function now counts the words and builds a sum.

Next week I will blog about the different Hadoop projects. As already mentioned earlier, Hadoop consists of several other projects.