Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Fregly
chris@fregly.com

Who am I?
Streaming Platform Engineer
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Spark Contributor
Spark Author
Consultant, Trainer
2
advancedspark.com

Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!
3

What is ?
4
Spark Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…
BlinkDB
approx queries

What is ?
6
Founded by the creators of
as a Service
Amazon AWS based
Powerful Visualizations
Collaborative Notebooks
Scala/Java, Python, SQL, R
Flexible Cluster Management
Job Scheduling and Monitoring

7
①Generate high-quality recommendations
②Demonstrate Spark high-level libraries:
③ Spark Streaming -> Kafka, Approximates
④ Spark SQL -> DataFrames, Cassandra
① GraphX -> PageRank, Shortest Path
① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way!

Focus of This Talk
9
①Parallelism
②Performance
③Real-time Streaming
④Approximations
⑤Similarity Measures
Spark and…

Brady Bunch circa 1980
11
Season 5, Episode 18: “Two Petes in a Pod”

Parallel Algorithm : O(log n)
12

Non-parallel Algorithm : O(n)
13

Daytona Gray Sort Contest
16
On-disk only
250,000 partitions
No in-memory caching
(2014)(2013) (2014)

Improved Shuffle and Network Layer
17
①“Sort-based shuffle”
②Minimize OS resources
③Switched to async Netty
④Keep CPUs hot
⑤Reuse byte buffers to minimize GC
⑥Use epoll for I/O to stay in kernel space

Project Tungsten: CPU and Memory
18
①More JVM bytecode generation, JIT optimize
②CPU-cache-aware data structs and algos
->
③Custom memory management
Serializers HashMap

DataFrames and Catalyst
19
19
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please
Use DataFrames!!
-->
JVM bytecode
generation

Columnar Storage Format
20
*Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)

Parquet File Format
21
①Based on Google Dremel Paper
②Implemented by Twitter and Cloudera
③Columnar storage format
④Optimized for fast columnar aggregations
⑤Tight compression
⑥Supports pushdowns
⑦Nested, self-describing, evolving schema

Types of Compression
22
①Run Length Encoding
Repeated data
②Dictionary Encoding
Fixed set of values
③Delta, Prefix Encoding
Sorted dataset

Types of Pushdowns
23
①Column, Partition Pruning
②Row, Predicate Filtering

Direct Kafka Streaming (KafkaRDD)
① No single Receiver, no Write Ahead Log (WAL)
② Workers pull from Kafka in parallel
③ Each KafkaRDD partition stores relevant offsets
④ Upon Worker Node failure, rebuild from offsets
⑤ Optimizes happy path by avoiding the WAL
25
At least once
delivery guarantee
<--

Count Min Sketch
27
① Approximate counters
② Better than HashMap
③ Low, fixed memory
④ Known error bounds
⑤ Large num of counters
⑥ Available in Twitter’s Algebird
⑦ Streaming example in Spark codebase

HyperLogLog
28
① Measures set cardinality
Approx count distinct
② Low memory
1.5KB @ 2% error
10^9 elements!
③ From Twitter’s Algebird
④ Streaming example in Spark codebase
⑤ RDD: countApproxDistinctByKey()

Types of Recommendations
30
①Non-personalized (2 out of 10)
Cold Start
No preference or behavior data for user, yet
②Personalized (8 out of 10)
User-Item Similarity
Items that others with similar prefs have
liked
Item-Item Similarity

Audience Participation Needed!
32
①Navigate to sparkafterdark.com
②Click 3 actors and 3 actresses
->
You are here
->

Non-personalized
Recommendations
33

Summary Statistics and Aggregations
34
①Top Users by Like Count
“I might like users with the highest sum aggregation
of likes overall.”
SparkSQL + DataFrame: Aggregations

Like Graph Analysis
35
②Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank

Demo!
Spark SQL + DataFrames + GraphX
36

Types of Similarity
38
①Euclidean: linear measure
Magnitude bias
②Cosine: angle measure
Adjust for magnitude bias
③Jaccard: Set intersection divided by union
Popularity bias
④Log Likelihood
Adjust for pop. bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z

All-pairs Similarity Measure
39
①Compare everything to everything
②aka. “pair-wise similarity” or “similarity join”
③Naïve shuffle: O(m*n^2); m=rows, n=cols
④Minimize shuffle: reduce data size & approx
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (0?)

Sampling Algo: DIMSUM
40
①"Dimension Independent Matrix Square
Using MR”
②Remove rows with low similarity probability
③MLlib: RowMatrix.columnSimilarities(…)
④Twitter: 40% efficiency gain over Cosine

Bucket Algo: Locality Sensitive Hashing
41
① Split into b buckets using similarity hash algo
Requires pre-processing of data
② Compare bucket contents in parallel
③ Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
④ Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
⑤ github.com/mrsqueeze/spark-hash

MLlib: SparseVector vs. DenseVector
42
① Remove columns using sparse vectors
② Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0

Personalized
Recommendations
43

Personalized Recommendation Terms
44
①User
User seeking likeable recommendations
②Item
User who has been liked
*Also a user seeking likeable recommendations!
③Types of Feedback
Explicit: rating, like
Implicit: search, click, hover, view, scroll

Collaborative Filtering Personalized Recs
45
③Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity

Text-based Personalized Recs
46
④Similar profiles to each other
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs
47
⑤Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs
48
⑥Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email< My Profile

Personalized Recommendations:
The Future
49

Facial Recognition
50
⑦Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Conversation Starter Bot
51
⑧NLP and DecisionTrees
“If your responses to my trite opening lines are positive,
I might actually read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive
response ->
Negative
<- response
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Compromise Recommendations (Couples)
53
⑨Pathway of Similarity
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors
… …

54
⑩ The Final Recommendation

⑩ Get Off The Computer and Meet People!
linkedin.com/in/cfregly
github.com/cfregly
chris@fregly.com
@cfregly
55
Thank you!
Image courtesy of http://www.duchess-france.org/
Free trial at databricks.com
Try !!

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

More Related Content

What's hot

Viewers also liked

Similar to Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

More from Chris Fregly

Recently uploaded

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark