Hadoop
Advanced
The Hadoop open-source software framework is widely used for reliable, and scalable distributed computing on a cluster of machines.
This competency area includes performing ETL operations using Apache Pig, Apache Cassandra NoSQL database, optimizing task execution on Hadoop using Tez, among others.
Key Competencies:
- Performing ETL operations using Apache Pig - Use Pig Latin for extract-transform-load operations on big data. Applicable for Analyst, Developer.
- Store data in Hadoop using the AVRO serialization system - Use JSON specification to define AVRO schemas for serialization that is fast and compact. Applicable for Operations, Developer.
- Apache Cassandra NoSQL database - Model and store data using keyspaces, tables, partitions, rows, and columns to model and store data. Use CQL to query data. Applicable for Architect, Developer.
- Store and query data at scale using HBase - Model data using tables, rows, columns, and column families, and cells. Query data at scale. Applicable for Architect, Developer.
- Perform data transformations at scale using Spark - Transform data using PySpark data frames and Spark SQL. Applicable for Developer.
- Optimize task execution on Hadoop using Tez - Define and run tasks using dataflow APIs. Applicable for Architect, Developer.
- Ingest data into Hadoop using Flume and Sqoop - Use Flume to ingest logs data and Sqoop to ingest structured data into Hadoop. Applicable for Operations, Developer.