0% found this document useful (0 votes)

62 views

Compare Hadoop and Spark.: Table

The document compares Hadoop and Spark, describing Spark as 100 times faster than Hadoop due to its ability to keep data in memory between operations. It lists the key features of Spark as being polyglot, fast, supporting multiple data formats, using lazy evaluation, enabling real-time computation, integrating with Hadoop, and including machine learning capabilities through MLlib. Resilient Distributed Datasets (RDDs) are described as fault-tolerant collections of data partitions that run in parallel across nodes in a Spark application.

Uploaded by

consania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Compare Hadoop and Spark.: Table

Uploaded by

consania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

1. Compare Hadoop and Spark.

We will compare Hadoop MapReduce and Spark based on the following aspects:

Feature Criteria Apache Spark Hadoop

Speed 100 times faster than Hadoop Decent speed
Processing Real-time & Batch processing Batch processing only
Easy because of high level
Difﬁculty Tough to learn
modules
Recovery Allows recovery of partitions Fault-tolerant
No interactive mode except Pig
Interactivity Has interactive modes
& Hive

Apache Spark vs. Hadoop

Table: Apache Spark versus Hadoop

Let us understand the same using an interesting analogy.

“Single cook cooking an entree is regular computing. Hadoop is multiple cooks cooking an entree into pieces and letting
each cook her piece.
Each cook has a separate stove and a food shelf. The ﬁrst cook cooks the meat, the second cook cooks the sauce. This
phase is called “Map”. A the end the main cook assembles the complete entree. This is called “Reduce”. For Hadoop, the
cooks are not allowed to keep things on the stove between operations. Each time you make a particular operation, the
cook puts results on the shelf. This slows things down.
For Spark, the cooks are allowed to keep things on the stove between operations. This speeds things up. Finally, for
Hadoop the recipes are written in a language which is illogical and hard to understand. For Spark, the recipes are nicely
written.” – Stan Kladko, Galactic Exchange.io

2. What is Apache Spark?

Apache Spark is an open-source cluster computing framework for real-time processing.
It has a thriving open-source community and is the most active Apache project at the moment.
Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Spark is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market
leader for Big Data processing. Many organizations run Spark on clusters with thousands of nodes. Today, Spark is being
adopted by major players like Amazon, eBay, and Yahoo!

GET STARTED WITH SPARK

3. Explain the key features of Apache Spark.

The following are the key features of Apache Spark:

1. Polyglot
2. Speed
3. Multiple Format Support
4. Lazy Evaluation
5. Real Time Computation
6. Hadoop Integration
7. Machine Learning

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 1/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

Let us look at these features in detail:

1. Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four
languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and
Python shell through ./bin/pyspark from the installed directory.

2. Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to
achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed
data processing with minimal network trafﬁc.

3. Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data
Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be
more than just simple pipes that convert data and pull it into Spark.

4. Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors
contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver
requests some data, does this DAG actually gets executed.

5. Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory
computation. Spark is designed for massive scalability and the Spark team has documented users of the system
running production clusters with thousands of nodes and supports several computational models.

6. Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big
Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions
of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource
scheduling.

7. Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data
processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark
provides data engineers and data scientists with a powerful, uniﬁed engine that is both fast and easy to use.

4. What are the languages supported by Apache Spark and which is the most popular one?
Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python
have interactive shells for Spark. The Scala shell can be accessed through ./bin/spark-shell and the Python shell
through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly
used for Spark.

5. What are beneﬁts of Spark over MapReduce?

Spark has the following beneﬁts over MapReduce:

1. Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster
than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing
tasks.
2. Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing,
Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
3. Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
4. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation
while there is no iterative computing implemented by Hadoop.

6. What is YARN?
Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to
deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas
Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running
Spark on YARN necessitates a binary distribution of Spark as built on YARN support.

7. Do you need to install Spark on all nodes of YARN cluster?

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 2/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use
YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some
conﬁgurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores,
andqueue.

8. Is there any beneﬁt of learning MapReduce if Spark is better than MapReduce?

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use
MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce
phases to optimize them better.

9. Explain the concept of Resilient Distributed Dataset (RDD).

RDD stands for Resilient Distribution Datasets. An RDD is a fault-tolerant collection of operational elements that run in
parallel. The partitioned data in RDD is immutable and distributed in nature. There are primarily two types of RDD:

1. Parallelized Collections: Here, the existing RDDs running parallel with one another.
2. Hadoop Datasets: They perform functions on each ﬁle record in HDFS or other storage systems.

RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in
Spark. This lazy evaluation is what contributes to Spark’s speed.

10. How do we create RDDs in Spark?

Spark provides two methods to create RDD:

1. By parallelizing a collection in your Driver program.

2. This makes use of SparkContext’s ‘parallelize’

1 method val DataArray = Array(2,4,6,8,10)

2
3 val DataRDD = sc.parallelize(DataArray)

3. By loading an external dataset from external storage like HDFS, HBase, shared ﬁle system.

11. What is Executor Memory in a Spark application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what
referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-
memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a
measure on how much memory of the worker node will the application utilize.

12. Deﬁne Partitions in Apache Spark.

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk
of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing
process. Spark manages data using partitions that help parallelize distributed data processing with minimal network trafﬁc
for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it.
Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to
hold the data chunks. Everything in Spark is a partitioned RDD.

13. What operations does RDD support?

RDD (Resilient Distributed Dataset) is main logical data unit in Spark. An RDD has distributed a collection of objects.
Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on
the disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change original RDD,
but you can always transform it into different RDD with all changes you want.

RDDs support two types of operations: transformations and actions.

Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and ﬁlter we just saw.
Transformations are executed on demand. That means they are computed lazily.

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 3/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data
into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file
system.

14. What do you understand by Transformations in Spark?

Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs.
map() and ﬁlter() are examples of transformations, where the former applies the function passed to it on each element of
RDD and results into another RDD. The ﬁlter() creates a new RDD by selecting elements from current RDD that pass
function argument.

1 val rawData=sc.textFile("path to/movies.txt")

2
3 val moviesData=rawData.map(x=>x.split("\t"))

As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.

15. Deﬁne Actions in Spark.

An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all
previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry
out all intermediate transformations and return ﬁnal results to Driver program or write it out to ﬁle system.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all
the values from RDD to a local node.

1 moviesData.saveAsTextFile(“MoviesData.txt”)

As we can see here, moviesData RDD is saved into a text ﬁle called MoviesData.txt.

Apache Spark and Scala Certiﬁcation Training

Instructor-led Sessions

Real-life Project
Assignments

Lifetime Access

Explore Curriculum

16. Deﬁne functions of SparkCore.

Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed
execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.
SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling
and interaction with storage systems. Further, additional libraries, built atop the core allow diverse workloads for
streaming, SQL, and machine learning. It is responsible for:

1. Memory management and fault recovery

2. Scheduling, distributing and monitoring jobs on a cluster
3. Interacting with storage systems

17. What do you understand by Pair RDD?

Apache deﬁnes PairRDD functions class as

1 class PairRDDFunctions[K, V] extends Logging with HadoopMapReduceUtil with Serializable

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 4/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs.
Pair RDDs allow users to access each key in parallel. They have a reduceByKey() method that collects data based on each
key and a join() method that combines different RDDs together, based on the elements having the same key.

18. Name the components of Spark Ecosystem.

1. Spark Core: Base engine for large-scale parallel and distributed data processing
2. Spark Streaming: Used for processing real-time streaming data
3. Spark SQL: Integrates relational processing with Spark’s functional programming API
4. GraphX: Graphs and graph-parallel computation
5. MLlib: Performs machine learning in Apache Spark

19. How is Streaming implemented in Spark? Explain with examples.

Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API.
It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is
DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from
different sources like Flume, HDFS is streamed and ﬁnally processed to ﬁle systems, live dashboards and databases. It is
similar to batch processing as the input data is divided into streams like batches.

Figure: Spark Interview Questions – Spark Streaming

20. Is there an API for implementing graphs in Spark?

GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient
Distributed Property Graph.

The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user
deﬁned properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a
high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed
multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and
mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of
graph algorithms and builders to simplify graph analytics tasks.

21. What is PageRank in GraphX?

PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an
endorsement ofv’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked
highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static
PageRank runs for a ﬁxed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing
by more than a speciﬁed tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

LEARN SPARK FROM EXPERTS

22. How is machine learning implemented in Spark?

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 5/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with
common learning algorithms and use cases like clustering, regression ﬁltering, dimensional reduction, and alike.

23. Is there a module to implement SQL in Spark? How does it work?

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It
supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL
will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data
processing.

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various
data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

The following are the four libraries of Spark SQL.

1. Data Source API

2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service

24. What is a Parquet ﬁle?

Parquet is a columnar format ﬁle supported by many other data processing systems. Spark SQL performs both read and
write operations with Parquet ﬁle and consider it be one of the best big data analytics formats so far.

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage
are as follows:

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 6/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka
1. Columnar storage limits IO operations.
2. It can fetch speciﬁc columns that you need to access.
3. Columnar storage consumes less space.
4. It gives better-summarized data and follows type-speciﬁc encoding.

25. How can Apache Spark be used alongside Hadoop?

The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of
technologies. Here, we will be looking at how Spark can beneﬁt from the best of Hadoop. Using Spark and Hadoop
together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN.

Figure: Using Spark and Hadoop

Hadoop components can be used alongside Spark in the following ways:

1. HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
2. MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing
framework.
3. YARN: Spark applications can also be run on YARN (Hadoop NextGen).
4. Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch
processing and Spark for real-time processing.

26. What is RDD Lineage?

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD
lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other
datasets.

27. What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on
data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

28. What ﬁle systems does Spark support?

The following three ﬁle systems are supported by Spark:

1. Hadoop Distributed File System (HDFS).

2. Local File system.
3. Amazon S3

29. List the functions of Spark SQL.

Spark SQL is capable of:

1. Loading data from a variety of structured sources.

2. Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL
through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.
3. Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and
SQL tables, expose custom functions in SQL, and more.

30. What is Spark Executor?

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 7/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark
processes that run computations and store the data on the worker node. The ﬁnal tasks by SparkContext are transferred
to executors for their execution.

31. Name types of Cluster Managers in Spark.

The Spark framework supports three major types of Cluster Managers:

1. Standalone: A basic manager to set up a cluster.

2. Apache Mesos: Generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
3. YARN: Responsible for resource management in Hadoop.

32. What do you understand by worker node?

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and
accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks.
Worker nodes process the data stored on the node and report the resources to the master. Based on the resource
availability, the master schedule tasks.

33. Illustrate some demerits of using Spark.

The following are some of the demerits of using Apache Spark:

1. Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
2. Developers need to be careful while running their applications in Spark.
3. Instead of running everything on a single node, the work must be distributed over multiple clusters.
4. Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efﬁcient processing of big data.
5. Spark consumes a huge amount of data when compared to Hadoop.

34. List some use cases where Spark outperforms Hadoop in processing.
1. Sensor Data Processing: Apache Spark’s “In-memory” computing works best here, as data is retrieved and combined
from different sources.
2. Real Time Processing: Spark is preferred over Hadoop for real-time querying of data. e.g. Stock Market
Analysis,Banking, Healthcare, Telecommunications, etc.
3. Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best
solution.
4. Big Data Processing: Spark runs upto 100 times faster than Hadoop when it comes to processing medium and large-
sized datasets.

35. What is a Sparse Vector?

A sparse vector has two parallel arrays; one for indices and the other for values. These vectors are used for storing non-
zero entries to save space.

Apache Spark and Scala Certiﬁcation Training

Watch The Course Preview

1 Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054))

The above sparse vector can be used instead of dense vectors.

1 val myHouse = Vectors.dense(4450d,2600000d,4000d,4.0,4.0,1978.0,95070d,1.0,1.0,1.0,0.0)

36. Can you use Spark to access and analyze data stored in Cassandra databases?

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 8/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a Cassandra cluster, a Cassandra Connector
will need to be added to the Spark project. In the setup, a Spark executor will talk to a local Cassandra node and will only
query for local data. It makes queries faster by reducing the usage of the network to send data between Spark executors
(to process data) and Cassandra nodes (where data lives).

37. Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos. In a standalone cluster deployment, the
cluster manager in the below diagram is a Spark master instance. When using Mesos, the Mesos master replaces the Spark
master as the cluster manager. Mesos determines what machines handle what tasks. Because it takes into account other
frameworks when scheduling these many short-lived tasks, multiple frameworks can coexist on the same cluster without
resorting to a static partitioning of resources.

38. How can Spark be connected to Apache Mesos?

To connect Spark with Mesos:

1. Conﬁgure the spark driver program to connect to Mesos.

2. Spark binary package should be in a location accessible by Mesos.
3. Install Apache Spark in the same location as that of Apache Mesos and conﬁgure the property
‘spark.mesos.executor.home’ to point to the location where it is installed.

39. How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shufﬂing helps write spark programs that run in a fast and reliable manner. The
various ways in which data transfers can be minimized when working with Apache Spark are:

1. Using Broadcast Variable- Broadcast variable enhances the efﬁciency of joins between small and large RDDs.
2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.

The most common way is to avoid operations ByKey, repartition or any other operations which trigger shufﬂes.

40. What are broadcast variables?

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a
copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efﬁcient manner. Spark
also attempts to distribute broadcast variables using efﬁcient broadcast algorithms to reduce communication cost.

41. Explain accumulators in Apache Spark.

Accumulators are variables that are only added through an associative and commutative operation. They are used to
implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running
stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 9/10
01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

42. Why is there a need for broadcast variables when working with Apache Spark?
Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage
of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster.
Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efﬁciency when
compared to an RDD lookup().

LEARN SPARK FROM EXPERTS

43. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different
batches and writing the intermediary results to the disk.

44. What is the signiﬁcance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library
provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever
the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs
of the windowed DStream.

45. What is a DStream in Apache Spark?

Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming. It is a continuous stream of data. It is
received from a data source or from a processed data stream generated by transforming the input stream. Internally, a
DStream is represented by a continuous series of RDDs and each RDD contains data from a certain interval. Any operation
applied on a DStream translates to operations on the underlying RDDs.

DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two
operations:

1. Transformations that produce a new DStream.

2. Output operations that write data to an external system.

https://www.edureka.co/blog/interview-questions/top-apache-spark-interview-questions-2016/ 10/10

Mastering The Najdorf
100% (2)
Mastering The Najdorf
228 pages
Manaul of Au3tech System Laser Sudura
0% (1)
Manaul of Au3tech System Laser Sudura
30 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
BP 401T MCQ Unit3 by Dr. Parjanya Shukla
80% (5)
BP 401T MCQ Unit3 by Dr. Parjanya Shukla
28 pages
CPS3
No ratings yet
CPS3
159 pages
Chaadaev's Continuity of Thought
No ratings yet
Chaadaev's Continuity of Thought
6 pages
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
spark theory
No ratings yet
spark theory
26 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Apache spark vs MapReduce(1)
No ratings yet
Apache spark vs MapReduce(1)
3 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
BD Notes 5
No ratings yet
BD Notes 5
37 pages
Sspark
No ratings yet
Sspark
7 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
SPARK Question answers
No ratings yet
SPARK Question answers
19 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Spark Interview Questions 04
No ratings yet
Spark Interview Questions 04
4 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
UNIT V
No ratings yet
UNIT V
35 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Shark
No ratings yet
Shark
24 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
Introduction To Spark
No ratings yet
Introduction To Spark
4 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Unit IV spark
No ratings yet
Unit IV spark
23 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Spark
No ratings yet
Spark
49 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Apache Spark
No ratings yet
Apache Spark
15 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Top 75 Apache Spark Interview Questions
No ratings yet
Top 75 Apache Spark Interview Questions
18 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
No ratings yet
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
28 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Soce-C 2019 W - Formula
100% (3)
Soce-C 2019 W - Formula
2 pages
Elephant'S Toothpaste: A Comparative Study of Two Catalysts With Experiment
No ratings yet
Elephant'S Toothpaste: A Comparative Study of Two Catalysts With Experiment
4 pages
ROIK-MIB2 v200311
No ratings yet
ROIK-MIB2 v200311
31 pages
Malpas
No ratings yet
Malpas
32 pages
26G Radar Level Meter
No ratings yet
26G Radar Level Meter
14 pages
Wibree Documentation
0% (2)
Wibree Documentation
15 pages
Likert 1932
No ratings yet
Likert 1932
53 pages
Exocad Smile Creator Brochure en Screen
No ratings yet
Exocad Smile Creator Brochure en Screen
7 pages
Nature and Scope of MCS
No ratings yet
Nature and Scope of MCS
19 pages
Who Is A Receptionist?
No ratings yet
Who Is A Receptionist?
3 pages
"Ancient of Days" "God Is Good All The Time" Chorus
No ratings yet
"Ancient of Days" "God Is Good All The Time" Chorus
2 pages
Asme 16.20
No ratings yet
Asme 16.20
21 pages
Food Production Project
No ratings yet
Food Production Project
10 pages
Quiz Chapter-22 Intangible-Assets
No ratings yet
Quiz Chapter-22 Intangible-Assets
9 pages
Makalah Bahasa Inggris-1
No ratings yet
Makalah Bahasa Inggris-1
8 pages
JSW INTERVIEW QUESTION_removed
No ratings yet
JSW INTERVIEW QUESTION_removed
5 pages
Faldic Ryc 102c3-Vvt2
100% (1)
Faldic Ryc 102c3-Vvt2
20 pages
HRM Final Project
100% (1)
HRM Final Project
27 pages
7-Principles-Of-Bioethics NEW
No ratings yet
7-Principles-Of-Bioethics NEW
11 pages
Wall Lamps
No ratings yet
Wall Lamps
33 pages
Internship Evaluation Form Rev06012020
No ratings yet
Internship Evaluation Form Rev06012020
2 pages
6. Minutes of Meeting
No ratings yet
6. Minutes of Meeting
3 pages
Understanding SCORS Ratings
No ratings yet
Understanding SCORS Ratings
5 pages
Nernst Eqn
No ratings yet
Nernst Eqn
17 pages
Rabbit Rabbits Are Small Mammals in The Family Leporidae of The Order
100% (1)
Rabbit Rabbits Are Small Mammals in The Family Leporidae of The Order
11 pages

Compare Hadoop and Spark.: Table

Uploaded by

Compare Hadoop and Spark.: Table

Uploaded by

01/05/2019 Top 55 Apache Spark Interview Questions For 2019 | Edureka

1. Compare Hadoop and Spark.

Feature Criteria Apache Spark Hadoop

Apache Spark vs. Hadoop

Let us understand the same using an interesting analogy.

2. What is Apache Spark?

GET STARTED WITH SPARK

3. Explain the key features of Apache Spark.

Let us look at these features in detail:

5. What are beneﬁts of Spark over MapReduce?

7. Do you need to install Spark on all nodes of YARN cluster?

8. Is there any beneﬁt of learning MapReduce if Spark is better than MapReduce?

9. Explain the concept of Resilient Distributed Dataset (RDD).

10. How do we create RDDs in Spark?

1. By parallelizing a collection in your Driver program.

2. This makes use of SparkContext’s ‘parallelize’

1 method val DataArray = Array(2,4,6,8,10)

11. What is Executor Memory in a Spark application?

12. Deﬁne Partitions in Apache Spark.

13. What operations does RDD support?

RDDs support two types of operations: transformations and actions.

14. What do you understand by Transformations in Spark?

1 val rawData=sc.textFile("path to/movies.txt")

As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.

15. Deﬁne Actions in Spark.

As we can see here, moviesData RDD is saved into a text ﬁle called MoviesData.txt.

Apache Spark and Scala Certiﬁcation Training

16. Deﬁne functions of SparkCore.

1. Memory management and fault recovery

17. What do you understand by Pair RDD?

1 class PairRDDFunctions[K, V] extends Logging with HadoopMapReduceUtil with Serializable

18. Name the components of Spark Ecosystem.

19. How is Streaming implemented in Spark? Explain with examples.

Figure: Spark Interview Questions – Spark Streaming

20. Is there an API for implementing graphs in Spark?

21. What is PageRank in GraphX?

LEARN SPARK FROM EXPERTS

22. How is machine learning implemented in Spark?

23. Is there a module to implement SQL in Spark? How does it work?

The following are the four libraries of Spark SQL.

1. Data Source API

24. What is a Parquet ﬁle?

25. How can Apache Spark be used alongside Hadoop?

Figure: Using Spark and Hadoop

Hadoop components can be used alongside Spark in the following ways:

26. What is RDD Lineage?

27. What is Spark Driver?

28. What ﬁle systems does Spark support?

1. Hadoop Distributed File System (HDFS).

29. List the functions of Spark SQL.

1. Loading data from a variety of structured sources.

30. What is Spark Executor?

31. Name types of Cluster Managers in Spark.

1. Standalone: A basic manager to set up a cluster.

32. What do you understand by worker node?

33. Illustrate some demerits of using Spark.

35. What is a Sparse Vector?

Apache Spark and Scala Certiﬁcation Training

Watch The Course Preview

The above sparse vector can be used instead of dense vectors.

1 val myHouse = Vectors.dense(4450d,2600000d,4000d,4.0,4.0,1978.0,95070d,1.0,1.0,1.0,0.0)

37. Is it possible to run Apache Spark on Apache Mesos?

38. How can Spark be connected to Apache Mesos?

1. Conﬁgure the spark driver program to connect to Mesos.

40. What are broadcast variables?

41. Explain accumulators in Apache Spark.

LEARN SPARK FROM EXPERTS

44. What is the signiﬁcance of Sliding Window operation?

45. What is a DStream in Apache Spark?

1. Transformations that produce a new DStream.

You might also like