Spark Programming

Programming
Data Infrastructure Team
엄태욱
2015-10-29
• SlideShare: http://goo.gl/yWFglI
• Github: https://goo.gl/zGnNtt
• Spark Package Download: http://goo.gl/vvpcrd (275.6MB)

http://spark.apache.org/
http://spark.apache.org/
• Developed in 2009 at UC Berkeley AMPLab
• Open sourced in 2010

Spark’s Goal
Support batch, streaming, and interactive computations
in a unified framework
http://strataconf.com/stratany2013/public/schedule/detail/30959

Spark Streaming
(Stream processing)
Spark SQL
(SQL/HQL)
MLlib
(Machine learning)
GraphX
(Graph computation)
Spark (Core Execution Engine)
Mesos Standalone YARN
Unified Platform
File System (Local, HDFS, S3, Tachyon) Data Store (Hive, HBase, Cassandra, …)
BlinkDB
(Approximate SQL)
Berkeley Data Analytics Stack
https://amplab.cs.berkeley.edu/software/

To validate our hypothesis
that specialized frameworks provide value over general ones,
we have also built a new framework
on top of Mesos called Spark,
optimized for iterative jobs
where a dataset is reused in many parallel operations,
and shown that Spark can outperform Hadoop by 10x
in iterative machine learning workloads.

http://spark.apache.org/docs/latest/cluster-overview.html
• Cluster Manager
• external service for acquiring resources on the cluster
• Standalone, Mesos, YARN
• Worker node
• Any node that can run application code in the cluster
• Application
• driver program + executors
• SparkContext
• application session
• connection to a cluster

http://spark.apache.org/docs/latest/cluster-overview.html
• Driver Program
• Process running the main()
• create SparkContext
• Executor
• process launched on a worker node
• Each application has its own executors
• Long-running and runs many small tasks
• keeps data in memory/disk storage
• Task
• Unit of work that will be sent to one executor

http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Report.pdf

Databricks Scala Guide
https://github.com/databricks/scala-style-guide

$ git clone https://github.com/apache/spark.git
$ git tag -l
$ git checkout tags/v1.5.1
Package Build
for CDH 5.4.7
$ JAVA_HOME=$(/usr/libexec/java_home -v 1.8) ./make-distribution.sh
--name hadoop2.6.0-chd5.4.7 --tgz --with-tachyon
-Pyarn -Phive -Phive-thriftserver -Phadoop-2.6
-Dhadoop.version=2.6.0-cdh5.4.7 -DskipTest
spark-1.5.1-bin-hadoop2.6.0-chd5.4.7.tgz
for Local Test
$ JAVA_HOME=$(/usr/libexec/java_home -v 1.8) ./make-distribution.sh
--name hadoop2.4 --tgz -Phadoop-2.4 -Pyarn -Phive -DskipTest
spark-1.5.1-bin-hadoop2.4.tgz

Pre-build Package
http://mirror.apache-kr.org/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.4.tgz
(275.6MB)
http://spark.apache.org/downloads.html

Shell in Local Mode
REPL(Read-Eval-Print Loop) = Interactive Shell
import org.apache.spark.{SparkContext, SparkConf} 
val sc = new SparkContext("local[*]", "Spark shell", new SparkConf())
scala>
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@36c783ca
자동으로 다음과 같이 SparkContext(sc) 생성한 후 shell prompt
• 장점
• 분산처리에서 어려운 local test/debugging, unit test
• lazy 처리: 데이터 처리 없이 syntax 먼저 검증 가능
• 단점
• 메모리 한계
• jar loading 차이(YARN/Mesos Shell/Submit Mode 차이)

$ cd ../
$ tar zxvf spark/spark-1.5.1-bin-spark-hadoop2.4.tgz
$ cd spark-1.5.1-bin-spark-hadoop2.4/
$ bin/spark-shell --master local[*]
Run Spark Shell

$ cat init.script
import java.lang.Runtime 
println(s"cores = ${Runtime.getRuntime.availableProcessors}")
$ SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell --master local[*] -i init.script
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/
bin/java -cp /Users/taewook/spark-1.5.1-bin-spark-hadoop2.4/conf/:/Users/taewook/
spark-1.5.1-bin-spark-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar -
Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master
local[*] --class org.apache.spark.repl.Main --name Spark shell spark-shell -i
init.script
========================================
...
Type :help for more information.
...
Spark context available as sc.
...
Loading init.script...
import java.lang.Runtime
cores = 4
scala>
scala> :load init.script
Loading init.script...
import java.lang.Runtime
cores = 4
Run Spark Shell
:paste enter paste mode: all input up to
ctrl-D compiled together
:cp <path> add a jar or directory to the classpath
:history [num] show the history
(optional num is commands to show)
~/.spark_history
java 실행 명령으로 classpath, JVM 옵션 등 확인
함수/클래스 정의 등 미리 실행하면 편리한 초기화 명령 수행

Web UI
http://localhost:4040/
sc.getConf.getAll.foreach(println)
local shell은
executor 1개(=driver) 
core 수는 local[N]만큼

An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.

• Read-only = Immutable
• Parallelism ➔ 분산 처리
• 오랫동안 Caching 가능 ➔ 성능
• Transformation for change
• 데이터 복사 반복 ➔ 성능➡, 공간 낭비
• Laziness로 극복
Core abstraction in the core of Spark
Resilient Distributed Dataset

• Partitioned = Distributed
• more partiton = more parallelism

• can be rebuilt = Resilient
• recover from lost data partitions
• by data lineage
• can be cached
• lineage 짧게 줄여서 더 빠르게 복구

• Scala Collection API와 비슷 + 분산 데이터 연산
• map(), filter(), reduce(), count(), foreach(), …
• http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
• Transformations
• return a new RDD
• deterministic - 실패해서 재실행해도 결과 항상 같음
• lazy evaluation
• Actions
• return a final value(some other data type)
• 첫 RDD부터 실제 실행(Caching 하면 cache된 RDD부터 실행)
RDD Operations
http://training.databricks.com/workshop/sparkcamp.pdf

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/
http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/
RDD API Examples

• Laziness = Lazily constructed
• 지연 계산 - 필요할 때까지 연산 미루기
• evaluation과 execution의 분리
• 실행 전에 최소한의 오류 검사
• 중간 RDD 결과값 저장 불필요
• Intermediate RDDs not materialized
• Immutability & Laziness
• Immutability ➔ Laziness 가능
• side-effect 없어 transformation들 combine 가능
• combine node steps into “stages” (최적화)
• ➔ 성능 , 분산 처리 가능

Creating RDDs
• parallelizing a collection
• 한 대의 driver 장비의 메모리에 모두 올림
• for only prototyping, testing
• loading an external data set
• 외부 소스로부터 읽기
• sc.textFile(): file://, hdfs://, s3n://
• sc.hadoopFile(), sc.newAPIHadoopFile()
• sqlContext.sql(), JdbcRDD(), …
scala> val numbers = sc.parallelize(1 to 10) 
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> val textFile = sc.textFile("README.md") 
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2]
at textFile at <console>:21

scala> val numbers = sc.parallelize(1 to 10) 
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> numbers.partitions.length
res0: Int = 4
 
scala> numbers.glom().collect()
res1: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5),
Array(6, 7), Array(8, 9, 10))
scala> val numbersWith2Partitions = sc.parallelize(1 to 10, 2) 
numbersWith2Partitions: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[2] at parallelize at <console>:21
scala> numbersWith2Partitions.partitions.length 
res2: Int = 2
scala> numbersWith2Partitions.glom().collect()
res3: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7,
8, 9, 10))
Partitions
numbers.mapPartitionsWithIndex()

https://github.com/deanwampler/spark-workshop
Partitions
• Logical division of data
• chunk of HDFS
• location aware
• basic unit of parallelism
• RDD is just collection of partitions

Word Count
val textFile = sc.textFile("README.md", 4)
 
val words = textFile.flatMap(line => line.split("[s]+"))
val realWords = words.filter(_.nonEmpty)
 
val wordTuple = realWords.map(word => (word, 1))
 
val groupBy = wordTuple.groupByKey(2)
 
val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
 
wordCount.collect().sortBy(-_._2)

Word Count
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
 
val words = textFile.flatMap(line => line.split("[s]+"))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23
val realWords = words.filter(_.nonEmpty)
realWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:25
 
val wordTuple = realWords.map(word => (word, 1))
wordTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:27
 
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29
 
val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at mapValues at
<console>:31
 
wordCount.collect().sortBy(-_._2)
res0: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10),
(##,8), (run,7), (can,6), (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
wordCount.saveAsTextFile("wordcount.txt")

• Main program are executed on the Spark Driver
• Transformations are executed on the Spark Workers
• Actions may transfer from the Workers to the Driver
collect(), countByKey(), countByValue(), collectAsMap()
➔ • bounded output: count(), take(N)
• unbounded output: saveAsTextFile()
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
executor(on worker) 동작 영역과 driver 동작 영역 구분
driver에서는 action과 accumulator 외에 executor의 데이터 받을 수 없음

Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.dependencies.head.rdd
res2: org.apache.spark.rdd.RDD[_] = ShuffledRDD[5] at groupByKey
at <console>:29
scala> textFile.dependencies.head.rdd
res3: org.apache.spark.rdd.RDD[_] = README.md HadoopRDD[0] at
textFile at <console>:21
scala> textFile.dependencies.head.rdd.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List()
모든 RDD는 부모 RDD 추적 ➔ DAG Scheduling과 복구의 기본

Data Lineage of RDD
res1: String =
flatMap
filter
map
groupByKey
mapValues
Step
Stage
textFile
textFile
Nil
Stage 1
Stage 0
Parent
shuffle
boundary

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Wide Dependency 최소화
RDD Dependencies
Local
Reference
Network
Communication
(Shuffle)
Narrow Dependencies
N steps ➔ 1 stage = 1 task
cogroup()

join()

groupByKey()

reduceByKey()

combineByKey()

…

Quiz
groupByKeyhadoopFile
textFile map filter
map
join join

Schedule & Execute tasks
$ bin/spark-shell --master local[3]
...
...

groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29
 
scala> groupBy.collect()
res5: Array[(String, Iterable[Int])] = Array((package,CompactBuffer(1)), (this,CompactBuffer(1)
), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-
version),CompactBuffer(1)), (Because,CompactBuffer(1)), (Python,CompactBuffer(1, 1)),
(cluster.,CompactBuffer(1)), (its,CompactBuffer(1)), ([run,CompactBuffer(1)),
(general,CompactBuffer(1, 1)), (YARN,,CompactBuffer(1)), (have,CompactBuffer(1)), (pre-
built,CompactBuffer(1)), (locally.,CompactBuffer(1)), (locally,CompactBuffer(1, 1)),
(changed,CompactBuffer(1)), (sc.parallelize(1,CompactBuffer(1)), (only,CompactBuffer(1)),
(several,CompactBuffer(1)), (learning,,CompactBuffer(1)), (basic,CompactBuffer(1)),
(first,CompactBuffer(1)), (This,CompactBuffer(1, 1)), (documentation,CompactBuffer(1, 1, 1)),
(Confi...
• HashMap within each partition
• no map-side aggregation (=combiner of MapReduce)
• single key-value pair must fit in memory (Out Of Memory/Disk)
groupByKey()
rdd.groupByKey().mapValues(value => value.reduce(func))
= rdd.reduceByKey(func)
대체 가능하다면 groupByKey 대신
reducedByKey, aggregateByKey, foldByKey, combineByKey 사용 권장

scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
wordCountReduceByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:29
scala> wordCountReduceByKey.toDebugString
res6: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
res7: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
reduceByKey()
scala> sc.setLogLevel("INFO")
...
... INFO DAGScheduler: Got job 3 (collect at <console>:32) with 2 output partitions
... INFO DAGScheduler: Final stage: ResultStage 7(collect at <console>:32)
... INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
... INFO DAGScheduler: Missing parents: List()
... INFO DAGScheduler: Submitting ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29), which has no missing
parents
... INFO MemoryStore: ensureFreeSpace(2328) called with curMem=109774, maxMem=555755765
... INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 529.9 MB)
... INFO MemoryStore: ensureFreeSpace(1378) called with curMem=112102, maxMem=555755765
... INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1378.0 B, free 529.9 MB)
... INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:60479 (size: 1378.0 B, free: 530.0 MB)
... INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
... INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29)
...
res9: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...

scala> sc.setLogLevel("INFO")
scala> val textFile = sc.textFile("README.md", 4)
scala> val words = textFile.flatMap(line => line.split("[s]+"))
scala> val realWords = words.filter(_.nonEmpty)
scala> val wordTuple = realWords.map(word => (word, 1))
scala> wordTuple.cache()
 
scala> val groupBy = wordTuple.groupByKey(2)
scala> val wordCount = groupBy.mapValues(value => value.reduce(
_ + _))
res2: String =
scala> wordCount.collect().sortBy(-_._2)
...
... INFO BlockManagerInfo: Added rdd_4_0 in memory on localhost:60641 (size: 9.9 KB, free: 530.0 MB)
...
Cache

scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
scala> wordCountReduceByKey.toDebugString 
res4: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
| CachedPartitions: 4; MemorySize: 38.1 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
...
... INFO BlockManager: Found block rdd_4_0 locally
...
Cache
• Spark은 반복 연산에 특화되어 빠르지만 cache 안하면 무의미
• cache 없으면 첫 RDD부터 모두 계산
• LRU(Least-Recently-Used) 정책
• 바로 재사용하지 않으면 불필요하게 메모리에 캐싱할 필요 없음
• 수동 cache 해제: RDD.unpersist()
• 기본 StorageLevel(MEMORY_ONLY)은 deserialized to memory
• 메모리 CPU

val words = textFile.flatMap(line => line.split("[s]+")) 
val realWords = words.filter(_.nonEmpty) 
realWords.cache() 
val wordTuple = realWords.map(word => (word, 1)) 
wordTuple.cache() 
 
val groupBy = wordTuple.groupByKey(2) 
val wordCount = groupBy.mapValues(value => value.reduce(_ + _)) 
wordCount.collect().sortBy(-_._2) 
 
val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2) 
wordCountReduceByKey.collect().sortBy(-_._2) 
 
realWords.countByValue().toArray.sortBy(-_._2)
scala.collection.Map[String,Long]
Word Count

• RDD API의 단점
• 대부분의 데이터는 구조화되어 있음 
(JSON, CSV, Avro, Parquet, ORC, Hive ...)
• 함수형 Transformation들은 직관적이지 않음
• ._1, ._2 방식으로 데이터 접근하기 불편하고 오류 가능성 높음
val data = sc.textFile(“people.tsv").map(line => line.split("t")) 
data.map(row => (row(0), (row(1).toLong, 1))) 
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)) 
.map { case (dept, values: (Long, Int)) => (dept, values._1 / values._2) } 
.collect() 
sqlContext.table("people") 
.groupBy('name) 
.avg('age) 
.collect()
DataFrame

DataFrame
• Distributed collection of rows organized into named columns.
• inspired by DataFrame in R and Pandas in Python
• RDD with schema (org.apache.spark.sql.SchemaRDD before v1.3)
• Python, Scala, Java, and R (via SparkR)
• Making Spark accessible to everyone (RDB, R에 익숙한 사람까지)
• data scientists, engineers, statisticians, ...
http://www.slideshare.net/databricks/2015-0616-spark-summit

DataFrame Operations
scala> val df = sqlContext.read.json("examples/src/main/resources/people.json") 
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
scala> df.explain(true)
== Parsed Logical Plan ==
Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json]
== Analyzed Logical Plan ==
age: bigint, name: string
== Optimized Logical Plan ==
== Physical Plan ==
Scan JSONRelation[file:/.../examples/src/main/resources/people.json][age#3L,name#4]
Code Generation: true

DataFrame Operations
scala> df.select(df("name"), df("age") + 1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
 
scala> df.filter(df("age") > 21).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+ 
scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
|null| 1|
| 19| 1|
| 30| 1|
+----+-----+
df.select('name, 'age + 1).show()
df.filter('age > 21).show()
df.groupBy('age).count().show()

DataFrame API
• API 사용법
• http://spark.apache.org/docs/latest/sql-programming-guide.html
• http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science
• https://github.com/yu-iskw/spark-dataframe-introduction/blob/master/doc/dataframe-introduction.md
• 책도 부족, 최신 내용은 Source Code 참고
• Example Code
• /spark/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala
• TestSuite
• /spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

http://www.slideshare.net/databricks/2015-0616-spark-summit
최적화 포인트를
DataFrame
한 곳으로

Same performance for all languages
http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015

• Simple tasks easy with DataFrame API
• Complex tasks possible with RDD API
scala> import sqlContext.implicits._ 
scala> case class Person(name: String, age: Int)
 
scala> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt)).toDF()
people: org.apache.spark.sql.DataFrame = [name: string, age: int]
 
scala> people.registerTempTable("people")
teenagers: org.apache.spark.sql.DataFrame = [name: string, age: int]
 
scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19") 
scala> teenagers.show()
+------+---+
| name|age|
+------+---+
|Justin| 19|
+------+---+
 
scala> val teenagersRdd = teenagers.rdd 
teenagersRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[8] at rdd at <console>:24
scala> teenagersRdd.toDebugString
res2: String =
(2) MapPartitionsRDD[8] at rdd at <console>:24 []
| MapPartitionsRDD[7] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:26 []
| MapPartitionsRDD[3] at map at <console>:26 []
| MapPartitionsRDD[2] at map at <console>:26 []
| examples/src/main/resources/people.txt HadoopRDD[0] at textFile at <console>:26 []
 
scala> teenagersRdd.collect()
res3: Array[org.apache.spark.sql.Row] = Array([Justin,19])

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
https://twitter.com/pwendell/status/649993257414340608

Questions?
questions.foreach( answer(_) )

• Apache Spark User List
• http://apache-spark-user-list.1001560.n3.nabble.com/
• Devops Advanced Class
• http://training.databricks.com/devops.pdf
• Intro to Apache Spark
• http://training.databricks.com/workshop/sparkcamp.pdf
• Apache Spark Tutorial
• http://cdn.liber118.com/workshop/fcss_spark.pdf
• Anatomy of RDD : Deep dive into Spark RDD abstraction
• http://www.slideshare.net/datamantra/anatomy-of-rdd
• A Deeper Understanding of Spark’s Internals
• https://spark-summit.org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
• Scala and the JVM for Big Data: Lessons from Spark
• https://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons.pdf
• Lightning Fast Big Data Analytics with Apache Spark
• http://www.virdata.com/wp-content/uploads/Spark-Devoxx2014.pdf
References

Quiz
groupByKeyhadoopFile
textFile map filter
map
join join
4 Stages!

PySpark
http://spark.apache.org/docs/latest/api/python/
+
Jupyter Notebook
https://jupyter.org/

• Partition 수
• 너무 큰 파일 읽을 때는 coalesce(N) 사용해서 executor 수 줄임
• repartition 없이 하나의 executor가 여러 partition 처리
• 오래걸리는 CPU 연산은 repartition(N) 으로 executor 수 늘림
• Executor 수
• Job 내 최대 partition 수의 2배 이상
• Executor Memory
• 장비 메모리의 최대 75% 사용 권장
• 최소 Heap 크기는 8GB
• 최대 Heap 크기는 40GB 넘지 않도록 (GC 확인 필요)
• 메모리 사용량은 StorageLevel과 Serialization 형식에 영향 받음
• G1GC 설정
• https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
• 각종 Tuning
• http://spark.apache.org/docs/latest/tuning.html
Tips

• min, max, sum, mean 여러번 따로 구하지 말고 stats() 한 번에 구하기
• count, sum, min, max, mean, stdev, variance, sampleStdev, smapleVariance
• Shuffle 문제 확인 방법
• Web UI에서 오래 걸리거나 큰 input/output이 있는 stage/partition 확인
• KryoSerializer
• /conf/spark-defaults.conf
• "spark.serializer", “org.apache.spark.serializer.KryoSerializer"
• StorageLevel.MEMORY_ONLY_SER 과 함께 사용 권장
• Task not serializable: java.io.NotSerializableException
• executor 내에서 수행되는 함수 내에서 객체 생성 및 사용
• https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
Tips

Spark Programming

More Related Content

What's hot

Similar to Spark Programming

Recently uploaded

Spark Programming