CASSANDRA DATA
MAINTENANCE WITH SPARK
Operate on your Data
WHAT IS SPARK?
A large-scale data processing framework
STEP 1:
Make Fake Data
(unless you have a million records to spare)
def create_fake_record( num: Int ) = {
(num,
1453389992000L + num,
s"My Token $num",
s"My Session Data$num")
}
sc.parallelize(1 to 1000000)
.map( create_fake_record )
.repartitionByCassandraReplica("maintdemo","user_visits",10)
.saveToCassandra("user_visits","oauth_cache")
THREE BASIC PATTERNS
• Read -Transform - Write (1:1) - .map()
• Read -Transform - Write (1:m) - .flatMap()
• Read - Filter - Delete (m:1) - it’s complicated
DELETES ARE TRICKY
DELETES ARE TRICKY
• Keep tombstones in mind
• Select the records you want to delete, then loop
over those and issue deletes through the driver
• OR select the records you want to keep, rewrite
them, then delete the partitions they lived in… IN
THE PAST…
DELETING
PREDICATE PUSHDOWN
• Use Cassandra-level filtering at every opportunity
• With DSE, benefit from predicate pushdown to
solr_query
GOTCHAS
• Null fields
• Writing jobs which aren’t or can’t be distributed.
TIPS &TRICKS
• .spanBy( partition key ) - work on one Cassandra
partition at a time
• .repartitionByCassandraReplica()
• tune
spark.cassandra.output.throughput_mb_per_sec to
throttle writes
USE CASE : CACHE
MAINTENANCE
USE CASE :TRIM USER
HISTORY
• Cassandra Data Model: PRIMARY KEY( userid,
last_access )
• Keep last X records
• .spanBy( partitionKey ) flatMap filtering Seq
USE CASE: PUBLISH DATA
• Cassandra Data Model: publish_date field
• filter by date, map to new RDD matching
destination, saveToCassandra()
USE CASE: MULTITENANT
BACKUP AND RECOVERY
• Cassandra Data Model: PRIMARY KEY((tenant_id,
other_partition_key), other_cluster, …)
• Backup: filter for tenant_id and .foreach() write to
external location.
• Recovery: read backup and upsert

Cassandra Data Maintenance with Spark

  • 1.
    CASSANDRA DATA MAINTENANCE WITHSPARK Operate on your Data
  • 2.
    WHAT IS SPARK? Alarge-scale data processing framework
  • 7.
    STEP 1: Make FakeData (unless you have a million records to spare)
  • 8.
    def create_fake_record( num:Int ) = { (num, 1453389992000L + num, s"My Token $num", s"My Session Data$num") } sc.parallelize(1 to 1000000) .map( create_fake_record ) .repartitionByCassandraReplica("maintdemo","user_visits",10) .saveToCassandra("user_visits","oauth_cache")
  • 9.
    THREE BASIC PATTERNS •Read -Transform - Write (1:1) - .map() • Read -Transform - Write (1:m) - .flatMap() • Read - Filter - Delete (m:1) - it’s complicated
  • 12.
  • 13.
    DELETES ARE TRICKY •Keep tombstones in mind • Select the records you want to delete, then loop over those and issue deletes through the driver • OR select the records you want to keep, rewrite them, then delete the partitions they lived in… IN THE PAST…
  • 14.
  • 15.
    PREDICATE PUSHDOWN • UseCassandra-level filtering at every opportunity • With DSE, benefit from predicate pushdown to solr_query
  • 16.
    GOTCHAS • Null fields •Writing jobs which aren’t or can’t be distributed.
  • 17.
    TIPS &TRICKS • .spanBy(partition key ) - work on one Cassandra partition at a time • .repartitionByCassandraReplica() • tune spark.cassandra.output.throughput_mb_per_sec to throttle writes
  • 18.
    USE CASE :CACHE MAINTENANCE
  • 19.
    USE CASE :TRIMUSER HISTORY • Cassandra Data Model: PRIMARY KEY( userid, last_access ) • Keep last X records • .spanBy( partitionKey ) flatMap filtering Seq
  • 20.
    USE CASE: PUBLISHDATA • Cassandra Data Model: publish_date field • filter by date, map to new RDD matching destination, saveToCassandra()
  • 21.
    USE CASE: MULTITENANT BACKUPAND RECOVERY • Cassandra Data Model: PRIMARY KEY((tenant_id, other_partition_key), other_cluster, …) • Backup: filter for tenant_id and .foreach() write to external location. • Recovery: read backup and upsert