Apache Spark
Introduction
Spark Architecture(componenets)
Hadoop (YARN)
Spark cluster management(YARN,STANDALONE,MESOS)
Spark vs Mapreduce
Spark core and basic abstraction
Spark Data structures (RDD, DataFrame, Dataset)
Spark Shell
RDD creation (extrenal source,paralaize collection ,from rdd)
Transfomration and actions(Lazy evaluation)
Spark meets Hive
What is hive?
Hive architecture
RDMS Vs hive
partitioning and bucketing in hive
Integrate Hive with Spark
Playing with RDDs
Ways to create rdd
Types of rdd
RDD *bykey operation(reduceby key ,combine by key,aggr by key)
Pair RDDs, parallel RDDs,shuffle rdd
Spark SQL(Dataframe,Dataset)
Ways to create dataframe
Spark SQL concepts and overview
Hive Queries through Spark
Dataframe trasnfomation
spark sql functions
spark storage formats Parquet, ORC
Save dataframe
Dataset (type safety)
Spark(polygot) Language support(Scala/python)
IDE's for scala and python
Scala Api Python and dependencies
Mavan overview for scala
Spark performace techniques
Partitions (scala and python partitioner)
Distributed execution (shuffling)
caching mechanisms available in Spark
Serilization in python and scala
Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Monitoring application
spark web ui
spark history server
log 4j using python
DAG (Directed Acyclic Graph)
Deploying spark application
spark submit command options
spark cluster vs client mode
Deploying and package scala code(WORD COUNT)
Deploying and package python code(Storing RDD)
Spark on cloud(AWS)
overview aws cloud
IAM users
IAM POLIE
AWS PRICING
EMR AND EC2
Deploying a spark code to AWS