HBaseCon East 2016
HBase and Spark, state of the art
• Java Message Service => JMS
• Solutions Architect at Cloudera
• A bit of everything…
• Development
• Team/Project manager
• Architect
• O'Reilly author of Architecting HBase Applications
• International
• Worked from Paris to Los Angeles
• More than 100 flights per year
• HBase (and others) contributor
About Jean-Marc Spaggiari
HBaseCon East 2016
Overview
• Where we came from
• Examples of code
• Improvements that are coming up
• Spark Streaming Use case
HBaseCon East 2016
Source of Demand
• Demand started in the field
• Use Cases
• APIs access Gets, Puts, Scans
• MapReduce Mass Scans
• MapReduce Bulk Load
• MapReduce Smart gets and puts
• Spark has all but killed MapReduce
• Spark Streaming has grown in popularity
• Populating Aggregates
• Entity Centric-Time Series data store i.e. OpenTSDB
• Look ups for joins or mutations
HBaseCon East 2016
How it Started
• Started on GitHub
• Andrew Purtell started the effort to put into HBase
• Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B
• Components
• Normal Spark
• Spark Streaming
• Bulk Load
• SparkSQL
HBaseCon East 2016
Under the covers
HBaseCon East 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
Key Addition: HBaseContext
Create an HBaseContext :
// An Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
HBaseCon East 2016
• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext
HBaseCon East2016
Foreach
Read data in parallel for each partition and compute :
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})
HBaseCon East 2016
Foreach
Read data in parallel for each partition and compute :
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {
@Override
public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {
BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);
while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls
}
mutator.flush();
mutator.close();
} }); });
HBaseCon East 2016
Map
Take a dataset and map it in parallel for each partition to produce a new RDD or process it
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})
HBaseCon East 2016
BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016
BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
BulkLoad
HBaseCon East 2016
BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016
BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016
BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
HBaseCon East 2016
BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
public Put call(String v1) throws Exception {
String[] tokens = v1.split("|");
Put put = new Put(Bytes.toBytes(tokens[0]));
put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),
Bytes.toBytes(tokens[2]));
return put;
}
});
HBaseCon East 2016
BulkDelete
Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(putRecord),
4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,
putRecord => new Delete(putRecord),
4) // batch size
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
What Improvements Have We Made?
▪ Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
▪ High Performance
• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
▪ Full Fledged DataFrame Support
• Spark-SQL
• Integrated Language Query
▪ Run on Top of Existing HBase Table
• Native Support Java Primitive Types
▪ Still some work and improvements to be done
• HBASE-16638 Reduce the number of Connections
• HBASE-14217 Add Java access to Spark bulk load functionality
HBaseCon East 2016
Data frame + HBase
WIP... 2.0?
HBaseCon East 2016
Usage - Define the Catalog
def catalog = s"""{
|"table":{"namespace":"default", "name":"table1"},
|"rowkey":"key",
|"columns":{
|"col0":{"cf":"rowkey", "col":"key", "type":"string"},
|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
|"col2":{"cf":"cf1", "col":"col2", "type":"string"}
|}
|}""".stripMargin
HBase table to dataframe table catalog mapping:
HBaseCon East 2016
Usage – Write to HBase
sc.parallelize(data)
.toDF
.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
Export RDD into a new HBase table with DataFrame
HBaseCon East 2016
Usage– Construct DataFrame
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
Usage - Language Integrate Query/SQL
val df = withCatalog(catalog)
val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||
$"col0" === "row005" ||
$"col0" === "row020" ||
$"col0" === "r20" ||
$"col0" <= "row005") &&
($"col2" === 1 ||
$"col2" === 42))
.select("col0", "col1", "col2")
s.show
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
Usage - Language Integrate Query/SQL
// Load the dataframe
val df = withCatalog(catalog)
//SQL example
df.registerTempTable("table")
sqlContext.sql("select count(col1) from table").show
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
Spark Streaming Example
KafkaProducer
Spark
Streaming
HBase SOLR
HBaseCon East 2016

More Related Content

PPTX
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
PPTX
The Evolution of a Relational Database Layer over HBase
PPTX
Apache Spark on Apache HBase: Current and Future
PPTX
Apache phoenix
PPTX
HBaseConEast2016: Splice machine open source rdbms
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
Apache phoenix: Past, Present and Future of SQL over HBAse
PPTX
Apache phoenix
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
The Evolution of a Relational Database Layer over HBase
Apache Spark on Apache HBase: Current and Future
Apache phoenix
HBaseConEast2016: Splice machine open source rdbms
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix

What's hot (20)

PDF
Apache Big Data EU 2015 - HBase
PDF
Apache Big Data EU 2015 - Phoenix
PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
PPTX
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
PPTX
April 2014 HUG : Apache Phoenix
PPTX
Dancing with the elephant h base1_final
PPTX
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PPTX
Apache Phoenix: Use Cases and New Features
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PPTX
Mapreduce over snapshots
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
PDF
Building large scale transactional data lake using apache hudi
PPTX
Apache Phoenix: Transforming HBase into a SQL Database
PDF
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
PPTX
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
PPTX
Apache Hive on ACID
PPTX
Apache Phoenix + Apache HBase
PDF
Hortonworks Technical Workshop: HBase and Apache Phoenix
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - Phoenix
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
April 2014 HUG : Apache Phoenix
Dancing with the elephant h base1_final
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
Apache Phoenix: Use Cases and New Features
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Mapreduce over snapshots
Large-Scale Stream Processing in the Hadoop Ecosystem
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Building large scale transactional data lake using apache hudi
Apache Phoenix: Transforming HBase into a SQL Database
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache Hive on ACID
Apache Phoenix + Apache HBase
Hortonworks Technical Workshop: HBase and Apache Phoenix
Ad

Viewers also liked (6)

PDF
Apache HBase 入門 (第2回)
PDF
Apache Spark streaming and HBase
PPSX
HBaseとSparkでセンサーデータを有効活用 #hbasejp
PPTX
Apache HBase Internals you hoped you Never Needed to Understand
PPTX
Free Code Friday - Spark Streaming with HBase
PDF
Apache HBase 入門 (第1回)
Apache HBase 入門 (第2回)
Apache Spark streaming and HBase
HBaseとSparkでセンサーデータを有効活用 #hbasejp
Apache HBase Internals you hoped you Never Needed to Understand
Free Code Friday - Spark Streaming with HBase
Apache HBase 入門 (第1回)
Ad

Similar to HBaseConEast2016: HBase and Spark, State of the Art (20)

PPTX
Unit II Hadoop Ecosystem_Updated.pptx
PDF
Techincal Talk Hbase-Ditributed,no-sql database
PPTX
HBase_-_data_operaet le opérations de calciletions_final.pptx
PDF
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
PDF
HBase, crazy dances on the elephant back.
PPT
Chicago Data Summit: Apache HBase: An Introduction
PDF
Intro to HBase - Lars George
PPTX
NoSQL & HBase overview
PPTX
H-Base in Data Base Mangement System
PDF
Big Data: Big SQL and HBase
ODP
HBase introduction talk
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
DOCX
Hbase Quick Review Guide for Interviews
PPTX
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
PPTX
Hadoop and HBase experiences in perf log project
PPTX
H base introduction & development
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
PPTX
Scaling HBase for Big Data
PPTX
Hbase.pptx
PDF
Michael stack -the state of apache h base
Unit II Hadoop Ecosystem_Updated.pptx
Techincal Talk Hbase-Ditributed,no-sql database
HBase_-_data_operaet le opérations de calciletions_final.pptx
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
HBase, crazy dances on the elephant back.
Chicago Data Summit: Apache HBase: An Introduction
Intro to HBase - Lars George
NoSQL & HBase overview
H-Base in Data Base Mangement System
Big Data: Big SQL and HBase
HBase introduction talk
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hbase Quick Review Guide for Interviews
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
Hadoop and HBase experiences in perf log project
H base introduction & development
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Scaling HBase for Big Data
Hbase.pptx
Michael stack -the state of apache h base

More from Michael Stack (20)

PDF
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
PDF
hbaseconasia2019 Recent work on HBase at Pinterest
PDF
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
PDF
hbaseconasia2019 HBase at Didi
PDF
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
PDF
hbaseconasia2019 HBase at Tencent
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
PDF
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
PDF
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
PDF
hbaseconasia2019 OpenTSDB at Xiaomi
PDF
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
PDF
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
PDF
hbaseconasia2019 Distributed Bitmap Index Solution
PDF
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
PDF
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
PDF
hbaseconasia2019 BDS: A data synchronization platform for HBase
PDF
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
PDF
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
PDF
HBaseConAsia2019 Keynote
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 HBase at Didi
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 HBase at Tencent
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
HBaseConAsia2019 Keynote

Recently uploaded (20)

PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PPTX
CONTRACTS IN CONSTRUCTION PROJECTS: TYPES
PPTX
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
Soil Improvement Techniques Note - Rabbi
PDF
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
PDF
Applications of Equal_Area_Criterion.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PPTX
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
PDF
Design of Material Handling Equipment Lecture Note
PPTX
Feature types and data preprocessing steps
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
PPTX
Information Storage and Retrieval Techniques Unit III
PPT
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
MLpara ingenieira CIVIL, meca Y AMBIENTAL
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
CONTRACTS IN CONSTRUCTION PROJECTS: TYPES
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Module 8- Technological and Communication Skills.pptx
Soil Improvement Techniques Note - Rabbi
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
Applications of Equal_Area_Criterion.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
Design of Material Handling Equipment Lecture Note
Feature types and data preprocessing steps
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
Information Storage and Retrieval Techniques Unit III
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
August -2025_Top10 Read_Articles_ijait.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
August 2025 - Top 10 Read Articles in Network Security & Its Applications

HBaseConEast2016: HBase and Spark, State of the Art

  • 1. HBaseCon East 2016 HBase and Spark, state of the art
  • 2. • Java Message Service => JMS • Solutions Architect at Cloudera • A bit of everything… • Development • Team/Project manager • Architect • O'Reilly author of Architecting HBase Applications • International • Worked from Paris to Los Angeles • More than 100 flights per year • HBase (and others) contributor About Jean-Marc Spaggiari HBaseCon East 2016
  • 3. Overview • Where we came from • Examples of code • Improvements that are coming up • Spark Streaming Use case HBaseCon East 2016
  • 4. Source of Demand • Demand started in the field • Use Cases • APIs access Gets, Puts, Scans • MapReduce Mass Scans • MapReduce Bulk Load • MapReduce Smart gets and puts • Spark has all but killed MapReduce • Spark Streaming has grown in popularity • Populating Aggregates • Entity Centric-Time Series data store i.e. OpenTSDB • Look ups for joins or mutations HBaseCon East 2016
  • 5. How it Started • Started on GitHub • Andrew Purtell started the effort to put into HBase • Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B • Components • Normal Spark • Spark Streaming • Bulk Load • SparkSQL HBaseCon East 2016
  • 6. Under the covers HBaseCon East 2016 Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  • 7. Key Addition: HBaseContext Create an HBaseContext : // An Hadoop/HBase Configuration object val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml")) conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) // sc is the Spark Context; hbase context corresponds to an HBase Connection val hbaseContext = new HBaseContext(sc, conf) HBaseCon East 2016
  • 8. • Foreach • Map • BulkLoad • BulkLoadThinRows • BulkGet (aka Multiget) • BulkDelete • Most of them in both Java and Scala Operations on the HBaseContext HBaseCon East2016
  • 9. Foreach Read data in parallel for each partition and compute : rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() }) HBaseCon East 2016
  • 10. Foreach Read data in parallel for each partition and compute : hbaseContext.foreachPartition(keyValuesPuts, new VoidFunction<Tuple2<Iterator<Put>, Connection>>() { @Override public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception { BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME); while (t._1().hasNext()) { ... // HBase API put/incr/append/cas calls } mutator.flush(); mutator.close(); } }); }); HBaseCon East 2016
  • 11. Map Take a dataset and map it in parallel for each partition to produce a new RDD or process it val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } }) HBaseCon East 2016
  • 12. BulkLoad Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only) rdd.hbaseBulkLoad(hbaseContext, tableName, t => { val rowKey = t._1 val fam:Array[Byte] = t._2._1 val qual = t._2._2 val value = t._2._3 val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual) Seq((keyFamilyQualifier, value)).iterator }, stagingFolder) val load = new LoadIncrementalHFiles(config) load.run(Array(stagingFolder, tableNameString)) HBaseCon East 2016
  • 13. BulkLoad Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only) rdd.hbaseBulkLoad(hbaseContext, tableName, t => { val rowKey = t._1 val fam:Array[Byte] = t._2._1 val qual = t._2._2 val value = t._2._3 val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual) Seq((keyFamilyQualifier, value)).iterator }, stagingFolder) val load = new LoadIncrementalHFiles(config) load.run(Array(stagingFolder, tableNameString)) HBaseCon East 2016
  • 14. Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only) rdd.hbaseBulkLoad(hbaseContext, tableName, t => { val rowKey = t._1 val fam:Array[Byte] = t._2._1 val qual = t._2._2 val value = t._2._3 val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual) Seq((keyFamilyQualifier, value)).iterator }, stagingFolder) val load = new LoadIncrementalHFiles(config) load.run(Array(stagingFolder, tableNameString)) BulkLoad HBaseCon East 2016
  • 15. BulkLoadThinRows Bulk load a data set into HBase (for skinny tables, <10k cols) hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath) HBaseCon East 2016
  • 16. BulkLoadThinRows Bulk load a data set into HBase (for skinny tables, <10k cols) hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath) HBaseCon East 2016
  • 17. BulkPut Parallelized HBase Multiput : hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put } HBaseCon East 2016
  • 18. BulkPut Parallelized HBase Multiput : hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() { @Override public Put call(String v1) throws Exception { String[] tokens = v1.split("|"); Put put = new Put(Bytes.toBytes(tokens[0])); put.addColumn(Bytes.toBytes("segment"), Bytes.toBytes(tokens[1]), Bytes.toBytes(tokens[2])); return put; } }); HBaseCon East 2016
  • 19. BulkDelete Parallelized HBase Multi-deletes hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size HBaseCon East 2016
  • 20. Table RDD How to materialize a table as a Spark RDD. HBaseCon East 2016
  • 21. Table RDD How to materialize a table as a Spark RDD. HBaseCon East 2016
  • 22. Table RDD How to materialize a table as a Spark RDD. HBaseCon East 2016
  • 23. Table RDD How to materialize a table as a Spark RDD. HBaseCon East 2016
  • 24. What Improvements Have We Made? ▪ Combine Spark and HBase • Spark Catalyst Engine for Query Plan and Optimization • HBase for Fast Access KV Store • Implement Standard External Data Source with Built-in Filter ▪ High Performance • Data Locality: Move Computation to Data • Partition Pruning: Task only Performed in RS Holding Requested Data • Column Pruning / Predicate Pushdown: Reduce Network Overhead ▪ Full Fledged DataFrame Support • Spark-SQL • Integrated Language Query ▪ Run on Top of Existing HBase Table • Native Support Java Primitive Types ▪ Still some work and improvements to be done • HBASE-16638 Reduce the number of Connections • HBASE-14217 Add Java access to Spark bulk load functionality HBaseCon East 2016
  • 25. Data frame + HBase WIP... 2.0? HBaseCon East 2016
  • 26. Usage - Define the Catalog def catalog = s"""{ |"table":{"namespace":"default", "name":"table1"}, |"rowkey":"key", |"columns":{ |"col0":{"cf":"rowkey", "col":"key", "type":"string"}, |"col1":{"cf":"cf1", "col":"col1", "type":"boolean"}, |"col2":{"cf":"cf1", "col":"col2", "type":"string"} |} |}""".stripMargin HBase table to dataframe table catalog mapping: HBaseCon East 2016
  • 27. Usage – Write to HBase sc.parallelize(data) .toDF .write .options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")) .format("org.apache.spark.sql.execution.datasources.hbase") .save() Export RDD into a new HBase table with DataFrame HBaseCon East 2016
  • 28. Usage– Construct DataFrame def withCatalog(cat: String): DataFrame = { sqlContext .read .options(Map(HBaseTableCatalog.tableCatalog->catalog)) .format("org.apache.spark.sql.execution.datasources.hbase") .load() } Import RDD into a new HBase table with DataFrame HBaseCon East 2016
  • 29. Usage - Language Integrate Query/SQL val df = withCatalog(catalog) val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") || $"col0" === "row005" || $"col0" === "row020" || $"col0" === "r20" || $"col0" <= "row005") && ($"col2" === 1 || $"col2" === 42)) .select("col0", "col1", "col2") s.show Import RDD into a new HBase table with DataFrame HBaseCon East 2016
  • 30. Usage - Language Integrate Query/SQL // Load the dataframe val df = withCatalog(catalog) //SQL example df.registerTempTable("table") sqlContext.sql("select count(col1) from table").show Import RDD into a new HBase table with DataFrame HBaseCon East 2016