Cassandra Data Maintenance with Spark

CASSANDRA DATA
MAINTENANCE WITH SPARK
Operate on your Data

WHAT IS SPARK?
A large-scale data processing framework

STEP 1:
Make Fake Data
(unless you have a million records to spare)

def create_fake_record( num: Int ) = {
(num,
1453389992000L + num,
s"My Token $num",
s"My Session Data$num")
}
sc.parallelize(1 to 1000000)
.map( create_fake_record )
.repartitionByCassandraReplica("maintdemo","user_visits",10)
.saveToCassandra("user_visits","oauth_cache")

THREE BASIC PATTERNS
• Read -Transform - Write (1:1) - .map()
• Read -Transform - Write (1:m) - .ﬂatMap()
• Read - Filter - Delete (m:1) - it’s complicated

DELETES ARE TRICKY
• Keep tombstones in mind
• Select the records you want to delete, then loop
over those and issue deletes through the driver
• OR select the records you want to keep, rewrite
them, then delete the partitions they lived in… IN
THE PAST…

PREDICATE PUSHDOWN
• Use Cassandra-level ﬁltering at every opportunity
• With DSE, beneﬁt from predicate pushdown to
solr_query

GOTCHAS
• Null ﬁelds
• Writing jobs which aren’t or can’t be distributed.

TIPS &TRICKS
• .spanBy( partition key ) - work on one Cassandra
partition at a time
• .repartitionByCassandraReplica()
• tune
spark.cassandra.output.throughput_mb_per_sec to
throttle writes

USE CASE :TRIM USER
HISTORY
• Cassandra Data Model: PRIMARY KEY( userid,
last_access )
• Keep last X records
• .spanBy( partitionKey ) ﬂatMap ﬁltering Seq

USE CASE: PUBLISH DATA
• Cassandra Data Model: publish_date ﬁeld
• ﬁlter by date, map to new RDD matching
destination, saveToCassandra()

USE CASE: MULTITENANT
BACKUP AND RECOVERY
• Cassandra Data Model: PRIMARY KEY((tenant_id,
other_partition_key), other_cluster, …)
• Backup: ﬁlter for tenant_id and .foreach() write to
external location.
• Recovery: read backup and upsert

Cassandra Data Maintenance with Spark

More Related Content

What's hot

Similar to Cassandra Data Maintenance with Spark

More from DataStax Academy

Recently uploaded

Cassandra Data Maintenance with Spark