PySpark

Basic

Apache Spark is an open-source software framework built on top of the Hadoop distributed processing framework.

This competency area includes installation of Spark standalone, executing commands on the Spark interactive shell, Reading and writing data using Data Frames, data transformation, and running Spark on the Cloud, among others. 

Key Competencies:

  1. Install and set up Spark - Install Spark standalone on a machine, configure environment variables install PySpark using pip. Applicable for Administrator and Developer.
  2. Execute commands on the Spark interactive shell - Performing basic data read, write, and transform operations on the Spark shell. Applicable for Operations, Developer.
  3. Use RDDs in Spark 2 - Performing in-memory transformations using lambdas, converting RDDs to Data Frames Applicable for Developer.
  4. Use Data Frames in Spark 2 - Reading and writing data using Data Frames (Datasets in Scala). Applicable for Developer.
  5. Perform transformations and actions on data - Performing grouping, aggregations on data, ordering data. Applicable for Developer.
  6. Submit and run a job on a Spark cluster - Using spark-submit to run long-running jobs on a Spark cluster. Applicable for Operations, Developer.
  7. Create and use shared variables in Spark - Use broadcast variables and accumulators. Applicable for Developer.
  8. Monitor Spark jobs - view scheduler stages, tasks, executor information. Applicable for Administration, Developer.
  9. Run Spark on the Cloud - Set up Spark on Amazon EMR, Azure HDInsight, and Google Cloud Dataproc and run Spark jobs. Applicable for Administration, Developer.