0% found this document useful (0 votes)

344 views37 pages

Scala and Spark Overview PDF

This document provides an overview of Scala, Apache Spark, and the big data ecosystem. It discusses how Scala was designed as a general purpose language that compiles to Java bytecode and can use Java libraries. It then explains key concepts in big data including Hadoop, MapReduce, and how Spark improves on MapReduce by keeping more data in memory and being faster. Resilient Distributed Datasets (RDDs) are introduced as the core of Spark, including transformations and actions. Finally, it briefly mentions Spark DataFrames.

Uploaded by

ingrobertorivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

344 views37 pages

Scala and Spark Overview PDF

Uploaded by

ingrobertorivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Scala and Spark

Overview
Scala and Spark

● In this lecture we will give an overview of the

Scala programming language
● Then we will discuss the general Big Data
Ecosystem
● Afterwards we will show how Apache Spark
fits into all of this.
Scala

● Scala is a general purpose programming

language
● It was designed by Matrin Odersky in the
early 2000s at EPFL (École Polytechnique
Fédérale de Lausanne)
● It was designed to overcome criticism of
Java’s shortcomings.
Scala

● Scala source code is intended to be compiled

to Java bytecode to run on a Java Virtual
Machine (JVM)
● Java libraries may be used directly in Scala
● Unlike Java, Scala has many features of
functional programming
Scala

● A large reason Scala demand has

dramatically risen in recent years is because
of Apache Spark.
● Let’s discuss what Spark is in the context of
Big Data.
● We’ll begin with a general explanation of what
Big Data is and related technologies.
Big Data Overview

● Explanation of Hadoop, MapReduce,and Spark

● Local versus Distributed Systems
● Overview of Hadoop Ecosystem
● Overview of Spark
Big Data

● Data that can fit on a local computer, in the scale of

0-32 GB depending on RAM.
● But what can we do if we have a larger set of data?
○ Try using a SQL database to move storage onto
hard drive instead of RAM
○ Or use a distributed system, that distributes the
data to multiple machines/computer.
Local versus Distributed

Local

Distributed
Local versus Distributed

Core Core Core Core Core

Local

Core Core Core Core Core Core

Distributed
Big Data

● A local process will use the computation resources of a

single machine
● A distributed process has access to the computational
resources across a number of machines connected
through a network
● After a certain point, it is easier to scale out to many lower
CPU machines, than to try to scale up to a single machine
with high a CPU
Big Data

● Distributed machines also have the advantage of easily

scaling, you can just add more machines
● They also include fault tolerance, if one machine fails, the
whole network can still go on.
● Let’s discuss the typical format of a distributed
architecture that uses Hadoop
Hadoop

● Hadoop is a way to distribute very large files across

multiple machines.
● It uses the Hadoop Distributed File System (HDFS)
● HDFS allows a user to work with large data sets
● HDFS also duplicates blocks of data for fault tolerance
● It also then uses MapReduce
● MapReduce allows computations on that data
Distributed Storage - HDFS
Name Node

CPU RAM

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM

Distributed Storage - HDFS

● HDFS will use blocks of Name Node

data, with a size of 128 CPU RAM
MB by default
● Each of these blocks is
replicated 3 times
● The blocks are Data Node Data Node Data Node
distributed in a way to
support fault tolerance CPU RAM CPU RAM CPU RAM
Distributed Storage - HDFS

● Smaller blocks provide Name Node

more parallelization CPU RAM
during processing
● Multiple copies of a
block prevent loss of
data due to a failure of a Data Node Data Node Data Node
node
CPU RAM CPU RAM CPU RAM
MapReduce

● MapReduce is a way of Job Tracker

splitting a computation CPU RAM
task to a distributed set
of files (such as HDFS)
● It consists of a Job
Tracker and multiple Task Task Task
Tracker Tracker Tracker
Task Trackers
CPU RAM CPU RAM CPU RAM
MapReduce

● The Job Tracker sends Job Tracker

code to run on the Task CPU RAM
Trackers
● The Task trackers
allocate CPU and
memory for the tasks Task Task Task
Tracker Tracker Tracker
and monitor the tasks
on the worker nodes CPU RAM CPU RAM CPU RAM
Big Data

● What we covered can be thought of in two distinct parts:

○ Using HDFS to distribute large data sets
○ Using MapReduce to distribute a computational task to
a distributed data set
● Next we will learn about the latest technology in this space
known as Spark.
● Spark improves on the concepts of using distribution
Spark

● This lecture will be an abstract overview, we will discuss:

○ Spark
○ Spark vs MapReduce
○ Spark RDDs
○ Spark DataFrames
Spark

● Spark is one of the latest technologies being used to

quickly and easily handle Big Data
● It is an open source project on Apache
● It was first released in February 2013 and has exploded in
popularity due to it’s ease of use and speed
● It was created at the AMPLab at UC Berkeley
Spark

● You can think of Spark as a flexible alternative to

MapReduce
● Spark can use data stored in a variety of formats
○ Cassandra
○ AWS S3
○ HDFS
○ And more
Spark vs MapReduce

● MapReduce requires files to be stored in HDFS, Spark

does not!
● Spark also can perform operations up to 100x faster than
MapReduce
● So how does it achieve this speed?
Spark vs MapReduce

● MapReduce writes most data to disk after each map and

reduce operation
● Spark keeps most of the data in memory after each
transformation
● Spark can spill over to disk if the memory is filled
Spark RDDs

● At the core of Spark is the idea of a Resilient Distributed

Dataset (RDD)
● Resilient Distributed Dataset (RDD) has 4 main features:
○ Distributed Collection of Data
○ Fault-tolerant
○ Parallel operation - partioned
○ Ability to use many data sources
Spark RDDs
Spark RDDs
Spark RDDs

● RDDs are immutable, lazily evaluated, and cacheable

● There are two types of RDD operations:
○ Transformations
○ Actions
● Transformations are basically a recipe to follow.
● Actions actually perform what the recipe says to do and
returns something back.
Spark RDDs

● When discussing Spark syntax you will see RDD versus

DataFrame syntax show up.
● With the release of Spark 2.0, Spark is moving towards a
DataFrame based syntax, but keep in mind that the way
files are being distributed can still be thought of as RDDs,
it is just the typed out syntax that is changing
Spark RDDs

● We’ve covered a lot!

● Don’t worry if you didn’t memorize all these details, a lot of
this will be covered again as we learn about how to
actually code out and utilize these ideas!
Spark RDDs

● Basic Actions
○ First
○ Collect
○ Count
○ Take
Spark RDDs

● Collect - Return all the elements of the RDD as an array at

the driver program.
● Count - Return the number of elements in the RDD
● First - Return the first element in the RDD
● Take - Return an array with the first n elements of the
RDD
Spark RDDs

● Basic Transformations
○ Filter
○ Map
○ FlatMap
Spark DataFrames

● Spark DataFrames are also now the standard

way of using Spark’s Machine Learning
Capabilities.
● Spark DataFrame documentation is still pretty
new and can be sparse.
● Let’s get a brief tour of the documentation!
http://spark.apache.org/
Computer Machine Math &
Learning
Science Statistics
DS

Software Research

Domain
Knowledge
Computer Machine Math &
Learning
Science Statistics
DS

Software Research

Domain
Knowledge

Spark Overview
No ratings yet
Spark Overview
31 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
SPARK
No ratings yet
SPARK
66 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
SPARK
No ratings yet
SPARK
47 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Spark
No ratings yet
Spark
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
19 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Unit V
No ratings yet
Unit V
35 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
48 pages
Unit 5
No ratings yet
Unit 5
32 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Spark Programming and RDDs Overview
No ratings yet
Spark Programming and RDDs Overview
59 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Week7 MidtermReview
No ratings yet
Week7 MidtermReview
34 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
14 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
PySpark+Slides v1
100% (1)
PySpark+Slides v1
458 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
MapReduce vs. Spark: Big Data Processing
No ratings yet
MapReduce vs. Spark: Big Data Processing
21 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
Fire Alarm & Detection Kahra Maa Substations
100% (2)
Fire Alarm & Detection Kahra Maa Substations
6 pages
L I
No ratings yet
L I
13 pages
Evolution of Pneumatic Structures
No ratings yet
Evolution of Pneumatic Structures
11 pages
Indian Job Talks
No ratings yet
Indian Job Talks
7 pages
Network Performance Data Analysis
No ratings yet
Network Performance Data Analysis
143 pages
Structural Steel Erection Specifications
100% (2)
Structural Steel Erection Specifications
8 pages
Paris: Population and Urban Overview
No ratings yet
Paris: Population and Urban Overview
9 pages
Modernizing Oracle Forms to APEX
No ratings yet
Modernizing Oracle Forms to APEX
17 pages
Absorption Chillers: Crystallization Control
No ratings yet
Absorption Chillers: Crystallization Control
3 pages
Shri Harishankar Jha 2BHK Plan Option2
No ratings yet
Shri Harishankar Jha 2BHK Plan Option2
1 page
Abby Hatcher - 2
No ratings yet
Abby Hatcher - 2
65 pages
Fire Code
100% (1)
Fire Code
37 pages
The English Garden 201106
100% (2)
The English Garden 201106
132 pages
WWW - Manaresults.Co - In: Set No. 1
No ratings yet
WWW - Manaresults.Co - In: Set No. 1
4 pages
Composites: Intro To Engineering Chemistry
No ratings yet
Composites: Intro To Engineering Chemistry
46 pages
Structural Steel Estimation for Rinker Hall
No ratings yet
Structural Steel Estimation for Rinker Hall
3 pages
AWB OD200: WiMAX 802.16e Outdoor CPE
No ratings yet
AWB OD200: WiMAX 802.16e Outdoor CPE
2 pages
30 Tiler Interview Questions and Answers - InterviewPrep
No ratings yet
30 Tiler Interview Questions and Answers - InterviewPrep
15 pages
Grade 7 Indonesian Description Lesson
No ratings yet
Grade 7 Indonesian Description Lesson
2 pages
Ancient Egypt for Grade 6 Students
No ratings yet
Ancient Egypt for Grade 6 Students
11 pages
Manual de Instalacion de Puertas de Emergencia
No ratings yet
Manual de Instalacion de Puertas de Emergencia
28 pages
Accedian Solution Briefs PDF
No ratings yet
Accedian Solution Briefs PDF
17 pages
2.2 Roman Architecture-1
No ratings yet
2.2 Roman Architecture-1
104 pages
KZNIA-2010-35-1 Moses Mabhida Stadium
No ratings yet
KZNIA-2010-35-1 Moses Mabhida Stadium
10 pages
Fi-5120c 5220c Maint Guide
No ratings yet
Fi-5120c 5220c Maint Guide
55 pages
AppleTalk Networking Seminar
No ratings yet
AppleTalk Networking Seminar
21 pages
Introduction To Diameter
No ratings yet
Introduction To Diameter
8 pages
ArmaComfort Barrier in - Technical Datasheet - En-In
No ratings yet
ArmaComfort Barrier in - Technical Datasheet - En-In
5 pages
2500GT Slipway Construction Gantt Chart
No ratings yet
2500GT Slipway Construction Gantt Chart
12 pages
Navis Guide
No ratings yet
Navis Guide
12 pages

Scala and Spark Overview PDF

Uploaded by

Scala and Spark Overview PDF

Uploaded by

Scala and Spark

● In this lecture we will give an overview of the

● Scala is a general purpose programming

● Scala source code is intended to be compiled

● A large reason Scala demand has

● Explanation of Hadoop, MapReduce,and Spark

● Data that can fit on a local computer, in the scale of

Core Core Core Core Core

Core Core Core Core Core Core

● A local process will use the computation resources of a

● Distributed machines also have the advantage of easily

● Hadoop is a way to distribute very large files across

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM

● HDFS will use blocks of Name Node

● Smaller blocks provide Name Node

● MapReduce is a way of Job Tracker

● The Job Tracker sends Job Tracker

● What we covered can be thought of in two distinct parts:

● This lecture will be an abstract overview, we will discuss:

● Spark is one of the latest technologies being used to

● You can think of Spark as a flexible alternative to

● MapReduce requires files to be stored in HDFS, Spark

● MapReduce writes most data to disk after each map and

● At the core of Spark is the idea of a Resilient Distributed

● RDDs are immutable, lazily evaluated, and cacheable

● When discussing Spark syntax you will see RDD versus

● We’ve covered a lot!

● Collect - Return all the elements of the RDD as an array at

● Spark DataFrames are also now the standard

You might also like