Hadoop Deployment Cheat Sheet

Introduction

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.

In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.

The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.

Key Hadoop Distributions

Vendor Strength
Apache Hadoop The open source distribution from Apache
Hortonworks A leading vendor committed to a 100% open source package
Cloudera Hadoop filesystem w/proprietary components for enterprise needs
MapR Uses its own proprietary file system
IBM Integration w/ IBM analytics products
Pivotal Integration w/ Greenplum and Cloud Foundry (CF)

Hadoop Modules

Module Description
Common Common utilities. Supports other Hadoop modules
HDFS Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware
YARN Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling
MapReduce Software framework for parallel processing of large data sets based on YARN

Hadoop Components

Component Module Description
NameNode HDFS The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode)
Secondary NameNode HDFS High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file
JournalNode HDFS Arbiter node that supports auto failover between NameNodes
DataNode HDFS Nodes (or servers) that store the actual data
NFS3 Gateway HDFS Daemons that enable NFS3 support
ResourceManager YARN Global daemon that arbitrates resources among all the applications in the Hadoop cluster
ApplicationMaster YARN Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks
NodeManager YARN Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager
Container YARN Running specific tasks on a specific machine for a specific application based on allocated resources

Hadoop Ecosystem – Related Products

Product Description
Ambari A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters
Apex Big data in motion platform based on YARN
Azbakan Workflow job scheduling and management system for Hadoop
Flume Reliable, distributed and available service that streams logs into HDFS
Knox Authentication and Access gateway service for Hadoop
HBase Distributed non-relational database that runs on top of HDFS
Hive Data warehouse system based on Hadoop
Mahout Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce
Impala Enables low-latency SQL queries on HBase and HDFS
Oozie Workflow job scheduling and management system for Hadoop
Ranger Access policy manager for HDFS files, folders, databases, tables and columns
Spark Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library.
Sqoop Data migration application between RDBMS and Hadoop using CLI
Tez Application framework for running  complex Directed Acyclic Graph (DAG) of tasks based on YARN
Pig High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark
ZooKeeper Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud operator Service name
Amazon Web Services EMR (Elastic Map Reduce)
IBM Softlayer IBM Brightsight
Microsoft Azure HDInsight

Common Data Formats

Format Description
Avro JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.
Parquet Columnar storage format
ORC Fast Columnar storage format
RCFile Data placement format for Rational tables
SequenceFile Binary data format with a record of specific data types
Unstructured Hadoop also supports various unstructured data formats

Single Node Installation

Requirement Task Command
Java Installation Check version
>java -version
Install
>sudo apt-get -y update && sudo apt-get -y install default-jdk
Create User and Permissions Create User
>useradd hadoop

>passwd hadoop

>mkdir /home/hadoop

>chown -R hadoop:hadoop /home/hadoop
Create keys
>su - hadoop

>ssh-keygen -t rsa &&

>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

>&& chmod 0600 ~/.ssh/authorized_keys
Install from source
>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&

>tar xzf hadoop-2.7.2.tar.gz &&

>mv hadoop-2.7.2 hadoop
Environment Env Vars
>source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop

>export HADOOP_INSTALL=$HADOOP_HOME

>export HADOOP_MAPRED_HOME=$HADOOP_HOME

>export HADOOP_COMMON_HOME=$HADOOP_HOME
>export HADOOP_HDFS_HOME=$HADOOP_HOME
>export YARN_HOME=$HADOOP_HOME
>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
 Set Java_Home
>vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/opt/jdk1.8.0_05/
Configuration files  Edit if required
core-site.xml

hdfs-site.xml

mapred-site.xml

yarn-site.xml
Format NameNode
>hdfs namenode -format
Start System
>cd $HADOOP_HOME/sbin/

>start-dfs.sh

>start-yarn.sh
Test System
>bin/hdfs dfs -mkdir /user

>bin/hdfs dfs -mkdir /user/hadoop

>bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation

Task Command
Configure hosts on each node
>vi /etc/hosts

 192.168.1.11 hadoop-master

 192.168.1.12 hadoop-slave-1

 192.168.1.13 hadoop-slave-2
Enable cross node authentication
>su – hadoop

>ssh-keygen -t rsa

>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master

>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1

>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2

>chmod 0600 ~/.ssh/authorized_keys

>exit
Copy system
>su - hadoop

>cd /opt/hadoop

>scp -r hadoop hadoop-slave-1:/opt/hadoop

>scp -r hadoop hadoop-slave-2:/opt/hadoop
Configure Master
>su - hadoop

>cd /opt/hadoop/hadoop

>vi conf/masters
//add your master node to the file:
hadoop-master

>vi conf/slaves
//add your slave nodes to the file, one hostname per line:
hadoop-slave-1
hadoop-slave-2

>su - hadoop

>cd /opt/hadoop/hadoop

>bin/hadoop namenode -format
Start system
>bin/start-all.sh

Backup HDFS Metadata

Task Command
Stop the cluster
>stop-all.sh
Perform cold backup to metadata directories
>cd /data/dfs/nn

>tar -cvf /tmp/backup.tar.gz
Start the cluster
>start-all.sh

HDFS Basic Commands

Task Command
List the content of the home directory
>hdfs dfs -ls /data/
Upload a file from the local file system to HDFS
>hdfs dfs -put logs.csv /data/
Read the content of the file from HDFS
>hdfs dfs -cat /data/logs.csv
Change the permission of a file
>hdfs dfs -chmod 744 /data/logs.csv
Set the replication factor of a file to 3
>hdfs dfs -setrep -w 3 /data/logs.csv
Check the size of the file
>hdfs dfs -du -h /data/logs.csv
Move the file to the newly-created subdirectory
>hdfs dfs -mv logs.csv logs/
Remove directory from HDFS
>hdfs dfs -rm -r logs

HDFS Administration

Task Command
Balance the cluster storage
>hdfs balancer -threshold
Run the NameNode
>hdfs namenode
Run the secondary NameNode
>hdfs secondarynamenode
Run a datanode
>hdfs datanode
Run the NFS3 gateway
>hdfs nfs3
Run the RPC portmap for the NFS3 gateway
>hdfs portmap

YARN

Task Command
Show yarn help
>yarn
Define configuration file
>yarn [--config confdir]
Define log level
>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or TRACE
User commands
Show Hadoop classpath
>yarn classpath
Show and kill application
>yarn application
Show application attempt
>yarn applicationattempt
Show container information
>yarn container
Show node information
>yarn node
Show queue information
>yarn queue
Administration commands
Start NodeManager
>yarn nodemanager
Start Proxy web server
>yarn proxyserver
Start ResourceManager
>yarn resourcemanager
Run ResourceManager admin client
>yarn rmadmin
Start Shared Cache Manager
>yarn sharedcachemanager
Start TimeLineServer
>yarn timelineserver

MapReduce

Submit the WordCount MapReduce job to the cluster
>hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output
Check the output of this job in HDFS
>hadoop fs -cat logs -output/*
Submit a scalding job
>hadoop jar scalding.jar com.twitter.scalding.Tool Scalding
Kill a MapReduce job
>yarn application -kill

Resource Manager UI

Resource Default URI
NameNode http://:50070/
DataNode http://:50075/
Sec NameNode http://:50090/
Resource Manager http://:8088
HBase Master http://:60010

Secure Hadoop

Aspect Best Practice
Authentication
  • Define users
  • Enable Kerberos in Hadoop
  • Setup Knox gateway to control access and authentication to the HDFS cluster
  • Integrate with the organization’s SSO and LDAP
Authorization
  • Define groups
  • Define HDFS Permissions
  • Define HDFS ACL’s
  • Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns
Audit
  • Enable process execution audit trail
Data Protection
  • Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks

Project Concept
Iterate cluster sizing to optimize performance and meet actual load patterns
Hardware
Clusters with more nodes recover faster
The higher the storage per node, the longer the recovery time
Use commodity hardware:

  • Use large slow disks (SATA) without RAID (3-6TB disks)
  • Use as much RAM as is cost-effective (96-192GB RAM)
  • Use mainstream CPU with as many cores as possible (8-12 cores)
Invest in reliable hardware for the NameNodes
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
Networking cost should be 20% of hardware budget
40 nodes is the critical mass to achieve best performance/cost ratio
Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas
Operating System and JVM
Must be 64-bit
Set file descriptor limit to 64K (ulimit)
Enable time synchronization using NTP
Speed up reads by mounting disks with NOATIME
Disable hugepages
System
Enable monitoring using Ambari
Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to recover your cluster when needed
Avoid reaching 90% cluster disk utilization
Balance the cluster periodically using balancer
Edit metadata files using Hadoop utilities only, to avoid corruption
Keep replication >= 3
Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation
Clean /tmp regularly – it tends to fill up with junk files
Optimize the number of reducers to avoid system starvation
Verify that the file system you selected is supported by your Hadoop vendor
Data and System Recovery
Disk failure is not an issue
Data nodes failure is not a major issue
NameNodes failure is an issue even in a clustered environment
Make regular backups of namenode metadata
Enable NameNode clustering using ZooKeeper
Provide sufficient disk space for NameNode logging
Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml