0% found this document useful (0 votes)
38 views5 pages

Spark

Spark questions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views5 pages

Spark

Spark questions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

introduction to spark:
*open-source
*distributed computing system designed for big data processing and analytics.
*developed to handle large amount of data processing

2.key features of Apache spark:


*speed-lightening -fast data processing
*easy to use-providing multiple programming languages(java, Scala, python and R)
*can use various libraries like sql, machine learning, stream processing.
*supports complex analytics beyond map and reduce operations.
*has built-in libraries for ml, spark sql.
*Real-time processing
*fault -tolerance: it automatically recovers lost data in case of node failures by using resilient
distributed datasets.
*integration: integrates with big data tools and technologies like Hadoop hdfs, amazon
s3,kubernetes, apache mesos.

3.core components of spark:


*spark core- provides basic functionality such as task scheduling, memory management,
fault recovery ,interacting with storage systems. Also responsible for loading and processing
data.
*spark sql
*spark streaming
*ml lib
*graph x - Allows user to model and perform computation on graphs. example fb and linkedin

4.**Why Spark?**

- **Speed:** Spark significantly speeds up data processing, running applications up to 100


times faster in memory and 10 times faster on disk by minimizing disk read/write operations
and storing intermediate data in memory.
- **Hadoop Integration:** It enhances the Hadoop framework, which is based on the
MapReduce model, by addressing concerns related to query and execution wait times for
large datasets.
- **Multi-language Support:** Spark offers built-in APIs for Java, Scala, and Python, allowing
developers to write applications in various languages.
- **Advanced Analytics:** It supports SQL queries, streaming data, machine learning, and
graph algorithms, making it a versatile tool for complex data analysis.

5.Features of Apache Spark


• Speed
• Reusability
• In Memory Computing
• Advance analytics
• Real time stream Processing
• Lazy evaluation
• Dynamic in Nature
• Fault tolerence

6.Spark can be deployed in three ways:

1. **Standalone** - Spark runs on top of HDFS alongside MapReduce, with dedicated space
for HDFS.
2. **Hadoop YARN** - Spark runs on YARN, integrated into the Hadoop ecosystem, without
requiring pre-installation or root access.
3. **Spark in MapReduce (SIMR)** - Spark jobs are launched within a MapReduce
framework, allowing Spark usage without administrative access.

7.**Apache Spark Applications:**

- **Machine Learning:** Spark's MLlib library enables scalable advanced analytics like
clustering, classification, and dimensionality reduction.
- **Fog Computing**
- **Event Detection:** Spark Streaming allows real-time monitoring of unusual behaviors,
aiding in risk detection for financial, security, and health organizations.
- **Interactive Analysis**
- **Conviva:** A leading video company, Conviva uses Spark to optimize videos and manage
live traffic.
8.**Apache Spark Core:**

Spark Core is the fundamental execution engine of the Spark platform, offering in-memory
computing and managing datasets (RDDs). It handles memory management, job scheduling,
fault recovery, and interaction with storage systems.

9.**Key Features of Apache Spark Components:**

- **Spark Core:**
- Manages basic I/O functionalities, cluster monitoring, and fault recovery.
- Essential for programming with Spark.

- **Spark SQL:**
- Built on Spark Core, it introduces SchemaRDDs for handling structured and semi-
structured data.
- Features an extensible cost-based optimizer to enhance query performance.
- Provides a unified way to access various data sources and supports DataFrames.

- **Spark Streaming:**
- Lightweight component for real-time data processing with fast scheduling.
- Supports batch and stream processing with the same code.
- Ensures reliable, exactly-once message guarantees and integrates with Spark MLlib for
machine learning.

10.**Applications of Spark Streaming:**

- **Industries:** Useful in online advertisements, finance, supply chain management.


- **Technologies:** Applied in IoT sensors, diagnostics, and cybersecurity.

**Apache Spark MLlib:**

- **Overview:** A machine learning library with various algorithms, supporting Java, Scala,
and Python APIs.
- **Performance:** Nine times faster than Apache Mahout's disk-based implementation.
- **Algorithms:** Includes clustering, classification, decomposition, regression, and
collaborative filtering.

**Apache Spark GraphX:**

- **Overview:** A distributed graph-processing framework built on Spark.


- **Use Case:** Provides an API for graph computation, useful for analyzing network-related
data like Facebook and LinkedIn user connections.

11.**Local Mode vs. Cluster Mode:**

- **Local Mode:** Spark runs everything on a single machine without needing a resource
manager, making it easy to run Spark locally without additional setup.
- **Cluster Mode:** The Spark driver runs on one of the worker nodes and is managed by
YARN. It is used for running production jobs, with the client disconnecting after the
application starts.;

12.**Loading/Reading Data into RDD:**

- **`textFile()`**: Reads a file as a collection of lines, supporting single or multiple files from
various filesystems (local, HDFS, S3). Allows specifying the number of partitions.
- **Syntax:** `SparkContext.textFile("path_of_the_file")`

- **`wholeTextFiles()`**: Reads files as pairs of filename and content, returning a pair-RDD


with tuples of filenames and their contents.
- **Syntax:** `SparkContext.wholeTextFiles("path_of_the_file")`

**Storing/Saving an RDD:**

- **`saveAsTextFile()`**: Saves the RDD to the filesystem as one or multiple part files.
- **Syntax:** `SparkContext.saveAsTextFile("FileSystem_path")`

You might also like