It is recommended to use commodity general-purpose server hardware. Special hardware is not needed to run Hadoop clusters and in some cases can present problems.
It is recommended to run Hadoop cluster on a homogeneous hardware (all worker nodes have to possess same hardware caracteristics (same number of cores, RAM, disk space, etc).
The more disks the worker node has to store data - the better.
The faster disks the worker node has to store data - the better.
It is recommended to have separate disks for OS and for data storage.
It is recommended to use disks for data storage configured as JBOD (not recommended to use RAID). On the other hand, it is recommended to use RAID for disks with OS/Software.
It is recommended to limit the storage capacity of a worker node. 36 TB per node is a good estimate. Otherwise network saturation could be provoked in case of a failure of a worker node (due to data replication to other nodes).
It is recommended to enable HyperThreading.
A good estimate of a number of CPUs per worker node: vcores = HDDs x 2.
- Amount of RAM per node depends on the Hadoop services one pretends to run. Here is a formula to estimate the necesary amount of RAM: RAM = (CPUs * RAM_PER_YARN_CONTAINER) + IMPALA_RAM + HBASE_RAM + SO_RAM.
- The more bandwidth - the better (together with the HDD speed, network bandwidth is the common bottleneck in Hadoop clusters).
- It is recommended to run Hadoop on Linux. The most widely used distribution is RHEL/CentOS.
It is recommended to set vm.swappiness parameter in a range 1-10 (/etc/sysctl.conf file). The default value for most linux distributions is higher (60).
It is recommended to disable the transparent huge pages (THP) function. RHEL:
echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/defrag
Ubuntu/Debian, OEL, SLES:
echo "never" > /sys/kernel/mm/transparent_hugepage/defrag
It is recommended to disable SELINUX (even though some Hadoop distributions support running Hadoop with SELIUX enabled).
It is recommended to set a higher limits for open files and running processes for Hadoop users in /etc/security/limits.conf
hdfs - nofile 32768
hdfs - nproc 32768
It is recommended to configure a DNS server for hostname resolution (not to use the /etc/hosts file).
It is required to have the iverse resolution configured for all cluster hosts. A quick test could be done via Python shell:
python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'
The hostnames have to be in FQDN form. The command 'hostname -f' should return both hostname and domainname.
It is recommended to install and configure nscd service on all hosts to cache DNS.
- It is required to have the date and time in sync across all cluster nodes (via ntp or other source).
- It is recommended to use Oracle JDK.
There are three type of nodes in Hadoop cluster: master, worker and gateway. It is recommended to maintain separate nodes for each role (e.g. not to mix master-nodes with gateway-nodes).
For Hadoop services that require a database it is recommended to setup an external database and configure services to use it (it is not recommended to use embedded databases such as Derby, etc...).
For small/medium-sized clusters it is recommended to deploy the services that require a database on the database nodes. On large clusters it is recommended to setup a database on a dedicated host.
It is recommended to setup a Zookeeper ensemble on at least three nodes.
It is recommended to deploy HDFS and YARN in high avaliability mode
Recommended distribucion for master nodes in medium-sized cluster:
Master1: HS2, HM, Oozie, Hue, RDBMS, (Manager)
Master2 ZK, NN, JN, HBM
Master3 ZK, RM, JH, SH, ISS, ICS
Master4 ZK, NN, JN, RM, HBM
ZK - Zookeeper
NN - NameNode
JN - JournalNode
RM - YARN Resource Manager
JH - MapReduce JobHistory Service
SH - Spark JobHistory Service
ISS - Impala StateStore
ICS - Impala Catalog Service
CDH from Cloudera is recommended as a general-purpose Hadoop distribution. The main advantages over its rivals are:
- superior administration/data governance tools
- superior stability
In case of using CDH, it is recommended to use Cloudera Manager for cluster deployment/administration.
In case of using CDH, it is recommended to use parcels (not rpms) for cluster deployment.
BEEVA | Technology and innovative solutions for companies