This is the kick-off to the Apache Hive Tutorial. Over the next weeks, I will post different tutorials on how to use Hive. Hive is a key component of Hadoop and we will today start with a general description of it.
Basically, what is Hive all about? Hive is a distributed query engine and language (called HiveQL) for Hadoop. Its main purpose is to enable a large number of people working with data stored in Hadoop. Therefore, Facebook introduced Hive for their analysts. Below you can see the typical Dataflow in an Hive project.
The above image shows how the workflow goes: first, a Hive client sends a request to the Hive Server. After that, the driver takes over and submits to the JobClient. Jobs are then executed on a Hadoop or Spark Cluster. In our samples over the next tutorials, we will however use the Web UI from Hortonworks. But we will have a look at that later. First, let’s have a look at another component: HCatalog.
HCatalog is a service that makes it easy to use Hive. With this, files on HDFS are abstracted to look like databases and tables. HCatalog is therefore a metadata repository about the files on HDFS. Other tools on Hadoop or Spark take advantage of this and use HCatalog.
With traditional Datawarehouse or RDBMS sytems, one worked in Databases and SQL was the language how to access data from these systems. Hive provides the HiveQL (which we will look at more detailed in the coming blog posts). HiveQL basically works on Hadoop files, such as plain text files, OCR or Parquet.
One key aspect of Hive is that it is mainly read-oriented. This means that you don’t update data, as everything you do in Hadoop is built for analytics. Hive still provides the possibility to update data, but this is rather done as an append update (meaning, that the original data isn’t altered as in contrast to RDBMS systems).
One key element of Hive is security. It all enterprise environments, it is very important to secure your tables against different kind of access. Hive therefore supports different options:
- Storage-based authorization: Hive doesn’t care about the authorization. Auth is being handled via the Storage Layer (ACLs in Cloud Bucket/Object Store or HDFS ACLs)
- Standard-based Autorization via HiveServer2 over Databases: Storage-based authorization is all or nothing from a table – not fine-grained enough. Hive can also work with fine-grained auth from databases to only show colums/rows relevant to the user
- Authorization via Ranger or Sentry: Apache Projects that do advanced authorization in Hadoop and abstract the authorization issues •Allows advanced rules and access to data
To work with Hive, you will typically use HiveQL. In the next tutorial, we will have a look on how to setup an environment where you can work with Hive.
Header image: https://www.flickr.com/photos/karen_roe/32417107542