Hadoop
Basic
The Hadoop open-source software framework is widely used for reliable, and scalable distributed computing on a cluster of machines.
This competency area includes understanding Single node cluster in a standalone mode, in pseudo-distributed mode, running shell commands to interface with HDFS, performing parallel processing tasks, among others.
Key Competencies:
- Single node cluster in a standalone mode - Install Java environment, download the Hadoop jar and run Hadoop as a standalone process. Applicable for Operations, Developer.
- Single node cluster in pseudo-distributed mode - Configure SSH, format, and start the name node and data node daemons, configure and run the YARN process. Applicable for Operations, Developer.
- Monitor cluster and jobs - Monitor name nodes and data nodes using the name node web interface, monitor resource usage using the ResourceManager. Applicable for Administration, Developer.
- Run shell commands to interface with HDFS - Run basic list and view commands on HDFS, run commands to store and read data from HDFS, copy files to and from Hadoop. Applicable for Operations, Developer.
- Perform parallel processing tasks using MapReduce - Set up the Mapper, Reducer classes to process data stored in HDFS, derive from the right base classes, set up the basic configuration to run MapReduce jobs. Applicable for Developer.
- Schedule and manage tasks with YARN - Use the FIFO scheduler, capacity scheduler, and fair scheduler, configure task queues and submit MapReduce tasks to a specific queue. Applicable for Administration, Developer.
- Set up and configure a Hadoop cluster on a cloud platform - Configure a simple Hadoop cluster using either Amazon EMR, Azure HDInsight, or Google Cloud DataProc. Applicable for Administration, Developer.
- Run MapReduce jobs on Hadoop on a cloud platform - Run MapReduce jobs on Amazon EMR, Azure HDInsight, or Google Cloud DataProc. Configure bucket storage rather than HDFS storage on the cloud. Applicable for Operations, Developer.