Compare Hadoop and Spark.: Table
Compare Hadoop and Spark.: Table
“Single cook cooking an entree is regular computing. Hadoop is multiple cooks cooking an entree into pieces and letting
each cook her piece.
Each cook has a separate stove and a food shelf. The first cook cooks the meat, the second cook cooks the sauce. This
phase is called “Map”. A the end the main cook assembles the complete entree. This is called “Reduce”. For Hadoop, the
cooks are not allowed to keep things on the stove between operations. Each time you make a particular operation, the
cook puts results on the shelf. This slows things down.
For Spark, the cooks are allowed to keep things on the stove between operations. This speeds things up. Finally, for
Hadoop the recipes are written in a language which is illogical and hard to understand. For Spark, the recipes are nicely
written.” – Stan Kladko, Galactic Exchange.io
Spark is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market
leader for Big Data processing. Many organizations run Spark on clusters with thousands of nodes. Today, Spark is being
adopted by major players like Amazon, eBay, and Yahoo!
1. Polyglot
2. Speed
3. Multiple Format Support
4. Lazy Evaluation
5. Real Time Computation
6. Hadoop Integration
7. Machine Learning
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 1/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
1. Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four
languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and
Python shell through ./bin/pyspark from the installed directory.
2. Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to
achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed
data processing with minimal network traffic.
3. Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data
Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be
more than just simple pipes that convert data and pull it into Spark.
4. Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors
contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver
requests some data, does this DAG actually gets executed.
5. Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory
computation. Spark is designed for massive scalability and the Spark team has documented users of the system
running production clusters with thousands of nodes and supports several computational models.
6. Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big
Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions
of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource
scheduling.
7. Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data
processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark
provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.
4. What are the languages supported by Apache Spark and which is the most popular one?
Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python
have interactive shells for Spark. The Scala shell can be accessed through ./bin/spark-shell and the Python shell
through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly
used for Spark.
1. Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster
than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing
tasks.
2. Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing,
Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
3. Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
4. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation
while there is no iterative computing implemented by Hadoop.
6. What is YARN?
Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to
deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas
Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running
Spark on YARN necessitates a binary distribution of Spark as built on YARN support.
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 2/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use
YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some
configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores,
andqueue.
1. Parallelized Collections: Here, the existing RDDs running parallel with one another.
2. Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.
RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in
Spark. This lazy evaluation is what contributes to Spark’s speed.
3. By loading an external dataset from external storage like HDFS, HBase, shared file system.
Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw.
Transformations are executed on demand. That means they are computed lazily.
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 3/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data
into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file
system.
reduce() is an action that implements the function passed again and again until one value if left. take() action takes all
the values from RDD to a local node.
1 moviesData.saveAsTextFile(“MoviesData.txt”)
Real-life Project
Assignments
Lifetime Access
Explore Curriculum
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 4/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs.
Pair RDDs allow users to access each key in parallel. They have a reduceByKey() method that collects data based on each
key and a join() method that combines different RDDs together, based on the elements having the same key.
The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user
defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a
high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed
multigraph with properties attached to each vertex and edge.
To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and
mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of
graph algorithms and builders to simplify graph analytics tasks.
GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static
PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing
by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 5/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with
common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various
data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage
are as follows:
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 6/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
1. Columnar storage limits IO operations.
2. It can fetch specific columns that you need to access.
3. Columnar storage consumes less space.
4. It gives better-summarized data and follows type-specific encoding.
1. HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
2. MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing
framework.
3. YARN: Spark applications can also be run on YARN (Hadoop NextGen).
4. Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch
processing and Spark for real-time processing.
When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark
processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred
to executors for their execution.
Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks.
Worker nodes process the data stored on the node and report the resources to the master. Based on the resource
availability, the master schedule tasks.
1. Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
2. Developers need to be careful while running their applications in Spark.
3. Instead of running everything on a single node, the work must be distributed over multiple clusters.
4. Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
5. Spark consumes a huge amount of data when compared to Hadoop.
34. List some use cases where Spark outperforms Hadoop in processing.
1. Sensor Data Processing: Apache Spark’s “In-memory” computing works best here, as data is retrieved and combined
from different sources.
2. Real Time Processing: Spark is preferred over Hadoop for real-time querying of data. e.g. Stock Market
Analysis,Banking, Healthcare, Telecommunications, etc.
3. Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best
solution.
4. Big Data Processing: Spark runs upto 100 times faster than Hadoop when it comes to processing medium and large-
sized datasets.
1 Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054))
36. Can you use Spark to access and analyze data stored in Cassandra databases?
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 8/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a Cassandra cluster, a Cassandra Connector
will need to be added to the Spark project. In the setup, a Spark executor will talk to a local Cassandra node and will only
query for local data. It makes queries faster by reducing the usage of the network to send data between Spark executors
(to process data) and Cassandra nodes (where data lives).
39. How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The
various ways in which data transfers can be minimized when working with Apache Spark are:
1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 9/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
42. Why is there a need for broadcast variables when working with Apache Spark?
Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage
of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster.
Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when
compared to an RDD lookup().
43. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different
batches and writing the intermediary results to the disk.
DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two
operations:
https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 10/10