Apache HBase
Apache HBase
• Apache Hbase is a non-relational (NoSQL) database.
• HBase was created for hosting very large tables with
billions of rows and millions of columns.
• Provides random , real-time data access.
• Allows table inserts, updates and deletes.
• Runs on top of the Hadoop distributed file system.
• Hbase data is automatically replicated by HDFS for
higher availability.
Hbase Architecture
Hbase Architecture
• An Hbase table is automatically distributed across a set of cluster
nodes to increase scalability and performance. Hbase can scale
out to thousands of nodes. Each cluster node contains a portion of
a table called a region. Each region contains some number of
table rows.
• Each region is managed by a RegionServer service. RegionServers
typically run on the same machines that run the Hadoop
distributed file system DataNode service.
• RegionServers are managed by the Hmaster master service.
Hmaster functions include such things as:
Coordinating database metadata changes.
Monitoring the RegionServer nodes
Orchestrating load balanceing across RegionServer nodes.
Orchestrating recovery from failed RegionServer nodes.
• A Zookeeper cluster handles all configuration management. Hbase
client programs communicate with ZooKeeper first to find the
RegionServer node that manages the data to be read.
• Clients access Hbase through a Java API, a REST interface, a Thrift
gateway, or the Hbase shell command-line interface.
Hbase Architecture
Interaction between Dameons
Key-Value Mappings
• Hbase contains maps of keys and thier values.
Key --> Value
If we know the key, we can retrieve the value.
• Keys are multi-part (column family name, rowID, column
qualifier, timestamp) > value
• Column family name- determines storage properties
• All data in the same column family is stored together on
disk.
• rowID- used to access data and divide table data into
regions.
• Regions are maintained on seperate RegionServer nodes.
• Column qualifier – the column name, which is just a label in
the multi-part key
• In any given row, one or more columns might or might not
exist.
• Timestamp-used to version the data and support data
updates.
Rows and Columns
• Rows and Columns are implemented differently than in most
relational databases.
• A multi-part key identifies a cell with a value.
• Because a table is just a set of key>value mappings, a row is
nothing more than a logical collection of values.
Hbase is a Column-Oriented
Database
• A Column-oriented database stores column items together
on disk.
• Column-oriented databases are well suited for:
Fast column operations:
For Example
Calculating the sum or aggregate of an entire column of
data.
Finding the 50 largest items in a column of 2 billion records.
Spare datasets, which are common in big data use cases.
Hbase Operations Overview
• Hbase operations include put , get , delete and scan.
• There is no structured query language (SQL).
• Writes initially go to in-memory memstore.
• Writes are immediately logged to disk for durability.
• Writes are regularly flushed from memstore to a storefile on
disk.
HBase vs RDBMS