Apache Spark Interview Questions

Spark has become popular among data scientists and big data enthusiasts. If you are looking for the best collection of Apache Spark Interview Questions for your data analyst, big data or machine learning job, you have come to the right place.

In this Spark Tutorial, we shall go through some of the frequently asked Spark Interview Questions.

Entry Level Spark Interview Questions
Medium Level Spark Interview Questions
Advanced Spark Interview Questions

Entry Level Spark Interview Questions

What is Apache Spark?

Apache Spark is an Open Source Project from the Apache Software Foundation. Apache Spark is a data processing engine and is being used in data processing and data analytics. It has inbuilt libraries for Machine Learning, Graph Processing, and SQL Querying. Spark is horizontally scalable and is very efficient in terms of speed when compared to big data giant Hadoop.

When should you choose Apache Spark?

When the application needs to scale. When the application needs both batch and real-time processing of records. When the application needs to connect to multiple databases like Apache Cassandra, Apache Mahout, Apache HBase, SQL databases, etc. When the application should be able to query structured datasets cumulatively present across different database platforms.

Which builtin libraries does Spark have?

Spark has four builtin libraries. They are :

SQL and DataFrames
Spark Streaming
MLlib (Machine Learning Library)
GraphX

How fast is Apache Spark when compared to Hadoop ? Give us an example.

Apache Spark is about 100 times faster than Apache Hadoop. For a typical Logistic Regression problem, if Hadoop takes 110ms to complete, then Spark would take around 1ms.

Why is Spark faster than Hadoop?

Spark is so fast because it uses a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Which programming languages could be used for Spark Application Development?

One can use following programming languages.

Java
Scala
Python
R
SQL

On which platforms can Spark run?

When Spark is run in stand-alone cluster mode, it can run on :

Hadoop YARN
EC2
Mesos
Kubernetes

Which data sources can Spark access?

Spark can access data from hundreds of sources. Some of them are :

HDFC
Apache Cassandra
Apache HBase
Apache Hive

Can structured data be queried in Spark? If so, how?

Yes, structured data can be queried using :

SQL
DataFrame API

What is Spark MLlib?

MLlib is Spark’s scalable Machine Learning inbuilt library. The library contains many machine learning algorithms and utilities to transform the data and extract useful information or inference from the data.

How is MLlib scalable?

Spark’s DataFrames API realizes scalability.

What kinds of machine learning use cases does MLlib solve?

MLlib contains common learning algorithms that can solve problems like :

Clustering
Classification
Regression
Recommendation
Topic Modelling
Frequent itemsets
Association rules
Sequential pattern mining

What is Spark Context?

SparkContext instance sets up internal services for a Spark Application and also establishes a connection to a Spark execution environment. Spark Context should be created by Spark Driver Application.

What is an RDD?

RDD, short for Resilient Distributed Datasets, is a collection of elements. RDD is fault tolerant and can be operated on in parallel. RDD provides the abstraction for distributed computing across nodes in Spark Cluster.

What are RDD transformations?

Transformations are operations which create a new RDD from an existing RDD.

Some of the RDD transformations are :

map
filter
union
intersection

What are RDD actions?

Actions are operations which consume or run operations on an RDD and produce an output value.

Some of the RDD actions are :

reduce
collect
count
countByKey

What do you mean by ‘RDD Transformations are lazy’?

Transformations on an RDD are not actually executed until an action is encountered. Hence, RDD Transformations are called lazy.

What do you know about RDD Persistence?

Persistence means Caching. When an RDD is persisted, each node in the cluster stores the partitions (of the RDD) in memory (RAM). When there are multiple transformations or actions on an RDD, persistence helps to cut down the latency by the time required to load the data from file storage to memory.

What is the difference between cache() and persist() for an RDD?

cache() uses default storage level, i.e., MEMORY_ONLY.

persist() can be provided with any of the possible storage levels.

What do you mean by the default storage level: MEMORY_ONLY?

Default Storage Level – MEMORY_ONLY mean store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed.

What is Sliding Window Operation?

Medium Level Spark Interview Questions

How does Spark Context in Spark Application pick the value for Spark Master?

That can be done in two ways.

1. Create a new SparkConf object and set the master using its setMaster() method. This Spark Configuration object is passed as an argument while creating the new Spark Context.

</>

Copy

SparkConf conf = new SparkConf().setAppName("JavaKMeansExample")
        .setMaster("local[2]")
        .set("spark.executor.memory","3g")
        .set("spark.driver.memory", "3g");
 
JavaSparkContext jsc = new JavaSparkContext(conf);

2. <apache-installation-directory>/conf/spark-env.sh file, located locally on the machine, contains information regarding Spark Environment configuration. Spark Master is one the parameters that could be provided in the configuration file.

How do you configure Spark Application?

Spark Application could be configured using properties that could be set directly on a SparkConf object that is passed during SparkContext initialization.

Following are the properties that could be configured for a Spark Application.

Spark Application Name
Number of Spark Driver Cores
Spark Driver’s Maximum Result Size
Spark Driver’s Memory
Spark Executors’ Memory
Spark Extra Listeners
Spark Local Directory
Log Spark Configuration
Spark Master
Deploy Mode of Spark Driver
Log Application Information
Spark Driver Supervise Action

Reference : Configure Spark Application

What is the use of Spark Environment Parameters? How do you configure those?

Spark Environment Parameters affect the behavior, working and memory usage of nodes in a cluster.

These parameters could be configured using the local config file spark-env.sh located at <apache-installation-directory>/conf/spark-env.sh.

Reference: Configure Spark Ecosystem

How do you establish a connection from Apache Mesos to Apache Spark?

A connection to Mesos could be established in two ways.

Spark Driver program has to be configured to establish a connection to Mesos. Also, Spark binaries location should be accessible to the Apache Mesos program.
The other way to connect to Mesos is that installing Spark in the same location as that of Mesos, and configure the property, spark.mesos.executor.home to point to the location of Mesos installation.

How do you minimize data transfers between nodes in a cluster?

Minimizing shuffles can minimize the data transfers between nodes, and thus making the process fast. There are some practices that can reduce the usage of shuffling operations. Use broadcast variables to join small and large RDDs; and accumulators to update the variable’s values.

To run Spark Applications, should we install Spark on all the nodes of a YARN cluster?

Spark programs can be executed on top of YARN. So, there is no need to install Spark on the nodes a YARN cluster, to run spark applications.

To run Spark Applications, should we install Spark on all the nodes of a Mesos cluster?

Spark programs can be executed on top of Mesos. So, there is no need to install Spark on the nodes of a Mesos cluster, to run spark applications.

What is the usage of GraphX module in Spark?

GraphX is a graph processing library. It can be used to build and transform interactive graphs. Many algorithms are available with GraphX library. PageRank is one of them.

How does Spark handle distributed processing?

Spark provides an abstraction to the distributed processing through Spark RDD API. A general user does not need to worry about how data is processed in a distributed cluster. There are some exceptions though. When you optimize an application for performance, you should understand about operations and actions which require the data transfer between nodes.