0% found this document useful (0 votes)

38 views5 pages

Spark

Spark questions

Uploaded by

053 MANYADHARSHINI S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views5 pages

Spark

Spark questions

Uploaded by

053 MANYADHARSHINI S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1.

introduction to spark:
*open-source
*distributed computing system designed for big data processing and analytics.
*developed to handle large amount of data processing

2.key features of Apache spark:

*speed-lightening -fast data processing
*easy to use-providing multiple programming languages(java, Scala, python and R)
*can use various libraries like sql, machine learning, stream processing.
*supports complex analytics beyond map and reduce operations.
*has built-in libraries for ml, spark sql.
*Real-time processing
*fault -tolerance: it automatically recovers lost data in case of node failures by using resilient
distributed datasets.
*integration: integrates with big data tools and technologies like Hadoop hdfs, amazon
s3,kubernetes, apache mesos.

3.core components of spark:

*spark core- provides basic functionality such as task scheduling, memory management,
fault recovery ,interacting with storage systems. Also responsible for loading and processing
data.
*spark sql
*spark streaming
*ml lib
*graph x - Allows user to model and perform computation on graphs. example fb and linkedin

4.**Why Spark?**

- Speed: Spark significantly speeds up data processing, running applications up to 100

times faster in memory and 10 times faster on disk by minimizing disk read/write operations
and storing intermediate data in memory.
- **Hadoop Integration:** It enhances the Hadoop framework, which is based on the
MapReduce model, by addressing concerns related to query and execution wait times for
large datasets.
- **Multi-language Support:** Spark offers built-in APIs for Java, Scala, and Python, allowing
developers to write applications in various languages.
- **Advanced Analytics:** It supports SQL queries, streaming data, machine learning, and
graph algorithms, making it a versatile tool for complex data analysis.

5.Features of Apache Spark

• Speed
• Reusability
• In Memory Computing
• Advance analytics
• Real time stream Processing
• Lazy evaluation
• Dynamic in Nature
• Fault tolerence

6.Spark can be deployed in three ways:

1. **Standalone** - Spark runs on top of HDFS alongside MapReduce, with dedicated space
for HDFS.
2. **Hadoop YARN** - Spark runs on YARN, integrated into the Hadoop ecosystem, without
requiring pre-installation or root access.
3. **Spark in MapReduce (SIMR)** - Spark jobs are launched within a MapReduce
framework, allowing Spark usage without administrative access.

7.Apache Spark Applications:

- **Machine Learning:** Spark's MLlib library enables scalable advanced analytics like
clustering, classification, and dimensionality reduction.
- **Fog Computing**
- **Event Detection:** Spark Streaming allows real-time monitoring of unusual behaviors,
aiding in risk detection for financial, security, and health organizations.
- **Interactive Analysis**
- **Conviva:** A leading video company, Conviva uses Spark to optimize videos and manage
live traffic.
8.**Apache Spark Core:**

Spark Core is the fundamental execution engine of the Spark platform, offering in-memory
computing and managing datasets (RDDs). It handles memory management, job scheduling,
fault recovery, and interaction with storage systems.

9.Key Features of Apache Spark Components:

- **Spark Core:**
- Manages basic I/O functionalities, cluster monitoring, and fault recovery.
- Essential for programming with Spark.

- **Spark SQL:**
- Built on Spark Core, it introduces SchemaRDDs for handling structured and semi-
structured data.
- Features an extensible cost-based optimizer to enhance query performance.
- Provides a unified way to access various data sources and supports DataFrames.

- **Spark Streaming:**
- Lightweight component for real-time data processing with fast scheduling.
- Supports batch and stream processing with the same code.
- Ensures reliable, exactly-once message guarantees and integrates with Spark MLlib for
machine learning.

10.Applications of Spark Streaming:

- Industries: Useful in online advertisements, finance, supply chain management.

- **Technologies:** Applied in IoT sensors, diagnostics, and cybersecurity.

Apache Spark MLlib:

- **Overview:** A machine learning library with various algorithms, supporting Java, Scala,
and Python APIs.
- **Performance:** Nine times faster than Apache Mahout's disk-based implementation.
- **Algorithms:** Includes clustering, classification, decomposition, regression, and
collaborative filtering.

Apache Spark GraphX:

- Overview: A distributed graph-processing framework built on Spark.

- **Use Case:** Provides an API for graph computation, useful for analyzing network-related
data like Facebook and LinkedIn user connections.

11.Local Mode vs. Cluster Mode:

- **Local Mode:** Spark runs everything on a single machine without needing a resource
manager, making it easy to run Spark locally without additional setup.
- **Cluster Mode:** The Spark driver runs on one of the worker nodes and is managed by
YARN. It is used for running production jobs, with the client disconnecting after the
application starts.;

12.Loading/Reading Data into RDD:

- **`textFile()`**: Reads a file as a collection of lines, supporting single or multiple files from
various filesystems (local, HDFS, S3). Allows specifying the number of partitions.
- **Syntax:** `SparkContext.textFile("path_of_the_file")`

- `wholeTextFiles()`: Reads files as pairs of filename and content, returning a pair-RDD

with tuples of filenames and their contents.
- **Syntax:** `SparkContext.wholeTextFiles("path_of_the_file")`

**Storing/Saving an RDD:**

- **`saveAsTextFile()`**: Saves the RDD to the filesystem as one or multiple part files.
- **Syntax:** `SparkContext.saveAsTextFile("FileSystem_path")`

Spark
No ratings yet
Spark
4 pages
Apache Spark
No ratings yet
Apache Spark
3 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
SPARK
No ratings yet
SPARK
125 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
Shark
No ratings yet
Shark
24 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
6
No ratings yet
6
3 pages
Q1. Understanding Apache Spark
No ratings yet
Q1. Understanding Apache Spark
4 pages
Apache
No ratings yet
Apache
9 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Sspark
No ratings yet
Sspark
7 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
pyspark-1
No ratings yet
pyspark-1
19 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
M5 Q&a
No ratings yet
M5 Q&a
26 pages
CC PPT
No ratings yet
CC PPT
12 pages
IoT Module 5
No ratings yet
IoT Module 5
9 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
Unit 5
100% (1)
Unit 5
109 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Unit V
No ratings yet
Unit V
35 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Bda U4
No ratings yet
Bda U4
49 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
No ratings yet
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
16 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Cambridge Computer Science For IGCSE Cambridge Course Book 2022 Pages 1
No ratings yet
Cambridge Computer Science For IGCSE Cambridge Course Book 2022 Pages 1
17 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
CC103 Mod3
No ratings yet
CC103 Mod3
12 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark 101
No ratings yet
Spark 101
25 pages
Parallel-In, Parallel-Out, Universal Shift Register
No ratings yet
Parallel-In, Parallel-Out, Universal Shift Register
12 pages
SOFTWARE ENGINEERING March 2021
No ratings yet
SOFTWARE ENGINEERING March 2021
4 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
Java Notes Module 4 3rd Year
No ratings yet
Java Notes Module 4 3rd Year
24 pages
Cambridge IGCSE™: Information and Communication Technology 0417/13 May/June 2022
No ratings yet
Cambridge IGCSE™: Information and Communication Technology 0417/13 May/June 2022
15 pages
Other Planes of There Selected Writings Renée Green - The Complete Ebook Is Available For Download With One Click
100% (2)
Other Planes of There Selected Writings Renée Green - The Complete Ebook Is Available For Download With One Click
50 pages
Revista Encora - First Edition - Feb 2024
No ratings yet
Revista Encora - First Edition - Feb 2024
8 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
LPC-Link2 v3 Rev A1 Schematic
No ratings yet
LPC-Link2 v3 Rev A1 Schematic
6 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
GMC 300E Plus User Guide
No ratings yet
GMC 300E Plus User Guide
24 pages
PARAM Siddhi-AI System Manual Ver1.0
No ratings yet
PARAM Siddhi-AI System Manual Ver1.0
88 pages
Business Statistics: Assignment
No ratings yet
Business Statistics: Assignment
3 pages
Apache Spark IP Gemini 1.PDF
No ratings yet
Apache Spark IP Gemini 1.PDF
38 pages
Guidelines in EI Installation
No ratings yet
Guidelines in EI Installation
7 pages
STL ToneHub v2.0 User Manual
No ratings yet
STL ToneHub v2.0 User Manual
76 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Week 11 12 - Basic Web Page Creation Using Static Website and Online Platform PDF
No ratings yet
Week 11 12 - Basic Web Page Creation Using Static Website and Online Platform PDF
37 pages
Standard 1
No ratings yet
Standard 1
3 pages
0257-AMA-C-0038.Rev.2 Excel
No ratings yet
0257-AMA-C-0038.Rev.2 Excel
1 page
OceanofPDF - Com Hacking MySQL Breaking Optimizing - Lukas Vileikis
No ratings yet
OceanofPDF - Com Hacking MySQL Breaking Optimizing - Lukas Vileikis
381 pages
AZ204 Resources
No ratings yet
AZ204 Resources
3 pages
Duramin Indenter
No ratings yet
Duramin Indenter
6 pages
Factorizing Polynomials
No ratings yet
Factorizing Polynomials
51 pages
Báo Cáo Thực Hành Vi Điều Khiển
No ratings yet
Báo Cáo Thực Hành Vi Điều Khiển
39 pages
Week 04 Data Base Design: Database System
No ratings yet
Week 04 Data Base Design: Database System
47 pages
FortiProxy 2.0.0 Authentication Guide
No ratings yet
FortiProxy 2.0.0 Authentication Guide
80 pages
Shahbaz Akram BA - Transportation
No ratings yet
Shahbaz Akram BA - Transportation
5 pages
Recommender Systems-Chapter 3
No ratings yet
Recommender Systems-Chapter 3
47 pages
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
No ratings yet
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
3 pages
BackToThe Roots
No ratings yet
BackToThe Roots
6 pages
Coding Theory
No ratings yet
Coding Theory
4 pages

Spark

Uploaded by

Spark

Uploaded by

1.

2.key features of Apache spark:

3.core components of spark:

- **Speed:** Spark significantly speeds up data processing, running applications up to 100

5.Features of Apache Spark

6.Spark can be deployed in three ways:

7.**Apache Spark Applications:**

9.**Key Features of Apache Spark Components:**

10.**Applications of Spark Streaming:**

- **Industries:** Useful in online advertisements, finance, supply chain management.

**Apache Spark MLlib:**

**Apache Spark GraphX:**

- **Overview:** A distributed graph-processing framework built on Spark.

11.**Local Mode vs. Cluster Mode:**

12.**Loading/Reading Data into RDD:**

- **`wholeTextFiles()`**: Reads files as pairs of filename and content, returning a pair-RDD

You might also like

- Speed: Spark significantly speeds up data processing, running applications up to 100

7.Apache Spark Applications:

9.Key Features of Apache Spark Components:

10.Applications of Spark Streaming:

- Industries: Useful in online advertisements, finance, supply chain management.

Apache Spark MLlib:

Apache Spark GraphX:

- Overview: A distributed graph-processing framework built on Spark.

11.Local Mode vs. Cluster Mode:

12.Loading/Reading Data into RDD:

- `wholeTextFiles()`: Reads files as pairs of filename and content, returning a pair-RDD