top of page

Apache Spark

Introduction

Spark Architecture(componenets)

Hadoop  (YARN)

Spark cluster management(YARN,STANDALONE,MESOS)

Spark vs Mapreduce

Spark core and basic abstraction  

Spark Data structures (RDD, DataFrame, Dataset)

Spark Shell

RDD creation (extrenal source,paralaize collection ,from rdd)

Transfomration and actions(Lazy evaluation)

Spark meets Hive

What is hive?

Hive architecture

RDMS Vs hive

partitioning and bucketing in hive

Integrate Hive with  Spark

Playing with RDDs

Ways to create  rdd

Types of rdd

RDD *bykey operation(reduceby key ,combine by key,aggr by key)

Pair RDDs,  parallel RDDs,shuffle rdd

Spark SQL(Dataframe,Dataset)

Ways to create  dataframe

Spark SQL concepts and overview

Hive Queries through Spark

Dataframe trasnfomation

spark sql functions

spark storage formats  Parquet, ORC

Save dataframe

Dataset (type safety)

Spark(polygot) Language support(Scala/python)

IDE's for scala and python

Scala Api Python and dependencies

Mavan overview for scala 

Spark performace techniques

Partitions (scala and python partitioner)

Distributed execution (shuffling)

caching mechanisms available in Spark

Serilization in python and scala

Shared Variables: Broadcast Variables

Shared Variables: Accumulators

Monitoring application 

spark web ui

spark history server

log 4j  using python

DAG (Directed Acyclic Graph)

Deploying spark application

spark submit command options

spark cluster  vs client mode

Deploying and package scala code(WORD COUNT)

Deploying and package python code(Storing RDD)

Spark  on cloud(AWS)

overview  aws cloud

IAM users

IAM POLIE

AWS PRICING

EMR AND EC2

Deploying a spark code to AWS

bottom of page