Apache Spark
Advanced
Apache Spark is an open-source software framework built on top of the Hadoop distributed processing framework.
This competency area includes understanding the extraction of event time in streams, window operations on streams, data handling, managing, and monitoring streaming queries, implementing ML and Graph algorithms, among others.
Key Competencies:
- Extract event time in streams - Extract event time from streaming data. Applicable for Architect, Developer.
- Stateful window operations on streams - Using sliding and tumbling windows. Applicable for Architect, Developer.
- Implement join operations on streaming data - Perform stream-stream and batch-stream joins. Applicable for Architect, Developer.
- Set watermarks and handle late data - Handle windowed grouping aggregations with watermarks in append and update mode. Applicable for Architect, Developer.
- Manage and monitor streaming queries - Push metrics to external systems (Spark Dropwizard), access them programmatically. Applicable for Administration, Developer.
- Use ML Pipelines in Spark - Use transformers and estimators in pipelines, saving, and loading pipelines. Applicable for ML Engineer, Developer.
- Implement ML algorithms using Spark ML - Perform classification, regression, clustering algorithms, collaborative filtering. Applicable for ML Engineer, Developer.
- Model selection and tuning - Perform hyperparameter tuning using Spark ML. Applicable for ML Engineer, Developer.
- Graph operations in Spark with GraphX -Representing graphs, property, structural, and join operations in graphs. Applicable for Architect, Developer.
- Implement graph algorithms - Page rank, triangle counting. Applicable for Architect, Developer.