Hadoop

Hadoop

Advanced

The Hadoop open-source software framework is widely used for reliable, and scalable distributed computing on a cluster of machines.

This competency area includes performing ETL operations using Apache Pig, Apache Cassandra NoSQL database, optimizing task execution on Hadoop using Tez, among others. 

Key Competencies:

  1. Performing ETL operations using Apache Pig - Use Pig Latin for extract-transform-load operations on big data. Applicable for Analyst, Developer.
  2. Store data in Hadoop using the AVRO serialization system - Use JSON specification to define AVRO schemas for serialization that is fast and compact. Applicable for Operations, Developer.
  3. Apache Cassandra NoSQL database - Model and store data using keyspaces, tables, partitions, rows, and columns to model and store data. Use CQL to query data. Applicable for Architect, Developer.
  4. Store and query data at scale using HBase - Model data using tables, rows, columns, and column families, and cells. Query data at scale. Applicable for Architect, Developer.
  5. Perform data transformations at scale using Spark - Transform data using PySpark data frames and Spark SQL. Applicable for Developer.
  6. Optimize task execution on Hadoop using Tez - Define and run tasks using dataflow APIs. Applicable for Architect, Developer.
  7. Ingest data into Hadoop using Flume and Sqoop - Use Flume to ingest logs data and Sqoop to ingest structured data into Hadoop. Applicable for Operations, Developer.