One of the key infrastructure services for Hadoop is Apache ZooKeeper. ZooKeeper is in charge of coordinating nodes in the Hadoop cluster. Key challenges for ZooKeeper in that domain are to provide high availability for Hadoop and to take care of the distributed coordination.

Under these challenges, Hadoop takes care of managing the cluster configuration for Hadoop. A key challenge in the Hadoop Cluster is naming, which has to be applied to all nodes within a cluster. Apache ZooKeeper takes care of that by providing unique names to individual nodes based on naming conventions.

The hierarchy in Zookeeper
The hierarchy in Zookeeper

As shown in Figure 7, naming is hierarchical. This means that naming also occurs via a path. The Root instance starts with a “/”, all successors have their unique name, and their successors also apply this naming schema. This enables the cluster to have nodes with child-nodes, which in return has positive effects on maintainability.

ZooKeeper takes care of synchronization within the distributed environment and provides some group services to the Hadoop Cluster. As of synchronization, there is one server in the ZooKeeper Service that acts as the “Leader” of all servers running under the ZooKeeper Service. The following illustration shows this.

Synchronisation in the ZooKeeper Service
Synchronisation in the ZooKeeper Service

To ensure a high uptime and availability, individual servers in the ZooKeeper service are mirrored. Each of the servers in the service knows any other server. In case that one server has a failure and isn’t available any more, clients connect to other servers. The ZooKeeper service itself is built for failover and is also highly scalable.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!