Spark Interview Questions 04
Spark Interview Questions 04
Apache® Spark™ is a powerful open source processing engine built around speed, ease of use, and
sophisticated analytics. It was originally developed at UC Berkeley in 2009. It has become one of most
rapidly-adopted cluster-computing frameworks by enterprises in different industries across the globe.
Expert professionals are in great demand with the rise of the importance of big data and analytics. With the
rise in opportunities in big data, you need to be proficient in the tools and skills associated with it.
As a big data expert, it is expected that you should have experience in some of the prominent tools in the
industry, including Apache Spark.
This article will help you to crack an Apache Spark interview with some of the frequently-
asked questions:
Ans. RDD (Resilient Distribution Datasets) is a fault-tolerant collection of operational elements that run
parallel. The partitioned data in RDD is immutable and distributed.
Ans. There are primarily two types of RDD – parallelized collection and Hadoop datasets.
Ans. A sparse vector has two parallel arrays –one for indices and the other for values.
Q5. Mention some of the areas where Spark outperforms Hadoop in processing
Ans. Sensor data processing, real-time querying of data, and stream processing.
Q6. What are the languages supported by Apache Spark and which is the most
popular one?
Ans. There are four languages supported by Apache Spark – Scala, Java, Python, and R. Scala is the most
popular one.
Ans. Yarn is one of the key features in Spark, providing a central and resource management platform to
deliver scalable operations across the cluster.
Q8. Do you need to install Spark on all nodes of Yarn cluster? Why?
Ans. Yes.
Ans. The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in
between RDDs is known as the lineage graph.
Ans. Partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of
a large distributed data set. Partitioning is the process to derive logical units of data to speed up the
processing process.
Ans. Discretized Stream (DStream) is a sequence of Resilient Distributed Databases that represent a
stream of data.
Ans. Catalyst framework is an optimization framework present in Spark SQL. It allows Spark to
automatically transform SQL queries by adding new optimizations to build a faster processing system.
Q14. What are Actions in Spark?
Ans. An action helps in bringing back the data from RDD to the local machine. An action’s execution is the
result of all previously created transformations.
Ans. Parquet is a columnar format file supported by many other data processing systems.
Ans. Spark uses GraphX for graph processing to build and transform interactive graphs.
Ans. Hadoop distributed file system (HDFS), local file system, and Amazon S3.
Ans.
Stateless Transformations – Processing of the batch does not depend on the output of the previous
batch. Examples – map (), reduceByKey (), filter ().
Stateful Transformations – Processing of the batch depends on the intermediary results of the previous
batch. Examples –Transformations that depend on sliding windows.
Ans. Persist () allows the user to specify the storage level whereas cache () uses the default storage level.
Ans. SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer
arrays) with schema information about the type of data in each column.
These are some of the popular questions asked in an Apache Spark interview. Always be prepared to
answer all types of questions — technical skills, interpersonal, leadership or methodology. If you are
someone who has recently started your career in big data, you can always get certified in Apache Spark to
get the techniques and skills required to be an expert in the field.