1.
introduction to spark:
*open-source
*distributed computing system designed for big data processing and analytics.
*developed to handle large amount of data processing
2.key features of Apache spark:
*speed-lightening -fast data processing
*easy to use-providing multiple programming languages(java, Scala, python and R)
*can use various libraries like sql, machine learning, stream processing.
*supports complex analytics beyond map and reduce operations.
*has built-in libraries for ml, spark sql.
*Real-time processing
*fault -tolerance: it automatically recovers lost data in case of node failures by using resilient
distributed datasets.
*integration: integrates with big data tools and technologies like Hadoop hdfs, amazon
s3,kubernetes, apache mesos.
3.core components of spark:
*spark core- provides basic functionality such as task scheduling, memory management,
fault recovery ,interacting with storage systems. Also responsible for loading and processing
data.
*spark sql
*spark streaming
*ml lib
*graph x - Allows user to model and perform computation on graphs. example fb and linkedin
4.**Why Spark?**
- **Speed:** Spark significantly speeds up data processing, running applications up to 100
times faster in memory and 10 times faster on disk by minimizing disk read/write operations
and storing intermediate data in memory.
- **Hadoop Integration:** It enhances the Hadoop framework, which is based on the
MapReduce model, by addressing concerns related to query and execution wait times for
large datasets.
- **Multi-language Support:** Spark offers built-in APIs for Java, Scala, and Python, allowing
developers to write applications in various languages.
- **Advanced Analytics:** It supports SQL queries, streaming data, machine learning, and
graph algorithms, making it a versatile tool for complex data analysis.
5.Features of Apache Spark
• Speed
• Reusability
• In Memory Computing
• Advance analytics
• Real time stream Processing
• Lazy evaluation
• Dynamic in Nature
• Fault tolerence
6.Spark can be deployed in three ways:
1. **Standalone** - Spark runs on top of HDFS alongside MapReduce, with dedicated space
for HDFS.
2. **Hadoop YARN** - Spark runs on YARN, integrated into the Hadoop ecosystem, without
requiring pre-installation or root access.
3. **Spark in MapReduce (SIMR)** - Spark jobs are launched within a MapReduce
framework, allowing Spark usage without administrative access.
7.**Apache Spark Applications:**
- **Machine Learning:** Spark's MLlib library enables scalable advanced analytics like
clustering, classification, and dimensionality reduction.
- **Fog Computing**
- **Event Detection:** Spark Streaming allows real-time monitoring of unusual behaviors,
aiding in risk detection for financial, security, and health organizations.
- **Interactive Analysis**
- **Conviva:** A leading video company, Conviva uses Spark to optimize videos and manage
live traffic.
8.**Apache Spark Core:**
Spark Core is the fundamental execution engine of the Spark platform, offering in-memory
computing and managing datasets (RDDs). It handles memory management, job scheduling,
fault recovery, and interaction with storage systems.
9.**Key Features of Apache Spark Components:**
- **Spark Core:**
- Manages basic I/O functionalities, cluster monitoring, and fault recovery.
- Essential for programming with Spark.
- **Spark SQL:**
- Built on Spark Core, it introduces SchemaRDDs for handling structured and semi-
structured data.
- Features an extensible cost-based optimizer to enhance query performance.
- Provides a unified way to access various data sources and supports DataFrames.
- **Spark Streaming:**
- Lightweight component for real-time data processing with fast scheduling.
- Supports batch and stream processing with the same code.
- Ensures reliable, exactly-once message guarantees and integrates with Spark MLlib for
machine learning.
10.**Applications of Spark Streaming:**
- **Industries:** Useful in online advertisements, finance, supply chain management.
- **Technologies:** Applied in IoT sensors, diagnostics, and cybersecurity.
**Apache Spark MLlib:**
- **Overview:** A machine learning library with various algorithms, supporting Java, Scala,
and Python APIs.
- **Performance:** Nine times faster than Apache Mahout's disk-based implementation.
- **Algorithms:** Includes clustering, classification, decomposition, regression, and
collaborative filtering.
**Apache Spark GraphX:**
- **Overview:** A distributed graph-processing framework built on Spark.
- **Use Case:** Provides an API for graph computation, useful for analyzing network-related
data like Facebook and LinkedIn user connections.
11.**Local Mode vs. Cluster Mode:**
- **Local Mode:** Spark runs everything on a single machine without needing a resource
manager, making it easy to run Spark locally without additional setup.
- **Cluster Mode:** The Spark driver runs on one of the worker nodes and is managed by
YARN. It is used for running production jobs, with the client disconnecting after the
application starts.;
12.**Loading/Reading Data into RDD:**
- **`textFile()`**: Reads a file as a collection of lines, supporting single or multiple files from
various filesystems (local, HDFS, S3). Allows specifying the number of partitions.
- **Syntax:** `SparkContext.textFile("path_of_the_file")`
- **`wholeTextFiles()`**: Reads files as pairs of filename and content, returning a pair-RDD
with tuples of filenames and their contents.
- **Syntax:** `SparkContext.wholeTextFiles("path_of_the_file")`
**Storing/Saving an RDD:**
- **`saveAsTextFile()`**: Saves the RDD to the filesystem as one or multiple part files.
- **Syntax:** `SparkContext.saveAsTextFile("FileSystem_path")`