Spark Syllabus | Zealous-Tech

Apache Spark

Introduction

Spark Architecture(componenets)

Hadoop (YARN)

Spark cluster management(YARN,STANDALONE,MESOS)

Spark vs Mapreduce

Spark core and basic abstraction

Spark Data structures (RDD, DataFrame, Dataset)

Spark Shell

RDD creation (extrenal source,paralaize collection ,from rdd)

Transfomration and actions(Lazy evaluation)

Spark meets Hive

What is hive?

Hive architecture

RDMS Vs hive

partitioning and bucketing in hive

Integrate Hive with Spark

Playing with RDDs

Ways to create rdd

Types of rdd

RDD *bykey operation(reduceby key ,combine by key,aggr by key)

Pair RDDs, parallel RDDs,shuffle rdd

Spark SQL(Dataframe,Dataset)

Ways to create dataframe

Spark SQL concepts and overview

Hive Queries through Spark

Dataframe trasnfomation

spark sql functions

spark storage formats Parquet, ORC

Save dataframe

Dataset (type safety)

Spark(polygot) Language support(Scala/python)

IDE's for scala and python

Scala Api Python and dependencies

Mavan overview for scala

Spark performace techniques

Partitions (scala and python partitioner)

Distributed execution (shuffling)

caching mechanisms available in Spark

Serilization in python and scala

Shared Variables: Broadcast Variables

Shared Variables: Accumulators

Monitoring application

spark web ui

spark history server

log 4j using python

DAG (Directed Acyclic Graph)

Deploying spark application

spark submit command options

spark cluster vs client mode

Deploying and package scala code(WORD COUNT)

Deploying and package python code(Storing RDD)

Spark on cloud(AWS)

overview aws cloud

IAM users

IAM POLIE

AWS PRICING

EMR AND EC2

Deploying a spark code to AWS