Apache Spark

Apache Spark

Intermediate

Apache Spark is an open-source software framework built on top of the Hadoop distributed processing framework.

This competency area includes combining and analyzing data, performing data aggregations, configuring data sources and sinks, performing tuning, monitoring Spark jobs, performing transformations, and running SQL queries on streaming data, among others.

Key Competencies:

  1. Combine data using Data Frames - Performing join operations on data frames Applicable for Developer.
  2. Analyze data using Spark SQL - Using SQL queries to analyze, aggregate, and group data. Applicable Developer.
  3. Use windowing and partition operations - Perform data aggregations using windowing and partitioning techniques. Applicable for Developer.
  4. Configure different sources and sinks for data - Using Spark SQL with Parquet, JSON files, Hive tables, JDBC connections. Applicable for Operations, Developer.
  5. Performance tuning Spark SQL operations - Caching data, using broadcast hints for SQL queries. Applicable for Operations, Developer.
  6. Configure Apache Arrow - Improve Spark SQL performance, install and use Apache Arrow with Spark. Applicable for Administration, Developer.
  7. Monitor Spark jobs - Using the REST APIs to configure machines to run the JournalNodes for High-Availability. Applicable for Administration, Developer.
  8. Configure trigger and output modes for Streaming data - Use append mode, complete mode, update mode, file sinks, Kafka sinks, for each sink. Applicable for Developer.
  9. Perform transformations and run SQL queries on streaming data - Implement UDFs to run on streams, perform selection, projection, and aggregations on streams Applicable for Developer.