0% found this document useful (0 votes)
70 views

Columnar Database

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Columnar Database

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Columnar Database - HBase

Dr. Richa Sharma


Commonwealth University
Introduction
 Stores data tables by columns rather than by rows!

 This allows efficient data retrieval, especially when aggregate


operations are there on the query, and therefore quite helpful with
data analytics and data warehousing!

 Columnar storage enables better data compression due to the


similarity of data within a column – this enhances aggregation
operation on queries and data analytics!

 HBase, Cassandra, Amazon Redshift are examples of columnar


database.

 Columnar databases can use traditional SQL to load data and


execute queries.
Example of data in columnar DB

 Let’s assume a snapshot of a table as:

Attr1 Attr2 Attr3


1111 Val1 10000
2222 Val2 20000
3333 Val3 15000

 Columnar storage of this table will consider the data as:


(1111, 2222, 3333; Val1, Val2, Val3; 10000, 20000, 15000)

 Row-oriented storage of this table will store the data as:


(1111, Val1, 10000; 2222, Val2, 20000; 3333, Val3, 15000)
Columnar vs Row DB
 Columnar databases store data vertically in a table while row-oriented
databases store data horizontally organizing each record in a row of a
table.

 The data in the columnar database has a highly compressible nature that
speeds up aggregate operations such as AVG, MIN, MAX on big data.
Such operations are relatively slower on relational database.

 Column-based DBMS use self-indexing mechanism, which uses less disk


space than RDBMS containing the same data!

 Relational model focuses on structured data and adheres to the principles


of normalization, ensuring data integrity and consistency through well-
defined relationships. RDMBS prioritizes transactional processing and
quick access to entire record. Columnar databases leverage vertical
storage to enhance query performance, making them particularly suitable
for data warehousing and analytics tasks.
Benefits of Columnar DB
 Data within a single column is homogenous - this makes it highly
amenable to compression. Columnar databases capitalize on this by
applying advanced compression techniques, significantly reducing
storage requirements and associated costs. This compression also
results in less I/O overhead!

 In a columnar database, only the columns relevant to a query need to


be accessed and processed. This contrasts with row-based databases,
where entire rows must be read, even if only a few columns are
needed. This selective data retrieval translates to faster query
performance, especially for analytical queries that typically aggregate or
scan large volumes of data.

 Columnar databases are well-suited for vectorized operations, where


the same operation is applied to multiple data points simultaneously,
making them useful for big data processing.
Benefits of Columnar DB
 Due to their structure, columnar databases are inherently efficient at
aggregating and summarizing data, operations that are fundamental to
analytics and reporting. This makes them ideal choice for business
intelligence and analytical applications!

 Columnar databases store sparse data efficiently. In scenarios where


there are many missing or null values, columnar databases does not
store any data for those missing values. This leads to significant
storage saving compared to row-based systems.

 Columnar databases are generally easier to scale horizontally, which


means adding more servers to handle increased load. This scalability
is particularly beneficial in cloud computing environments where
resources can be dynamically adjusted based on demand.

Source: https://atlan.com/what-is/columnar-database
Limitations of Columnar DB
 Columnar databases are optimized for read-heavy analytical
queries, and are not suitable for transactional workloads.

 Columnar databases have higher overhead for writing data as


each data insertion or update may require accessing and
modifying several distinct column files, leading to increased
I/O overhead.

 Not suitable for row-oriented databases’ like SQL queries or


table joins!

 High learning curve for developers and cost considerations as


well!
HBase
Introduction

 HBase is a distributed, column-oriented database that is very


effective for handling large, sparse datasets.

 HBase is based on Bigtable, a high-performance, proprietary


database developed by Google, and described in a white paper
in the year 2006!

 HBase is written in Java.

 HBase integrates seamlessly with Apache Hadoop and runs on


top of the Hadoop Distributed File System (HDFS). HBase
serves as a direct input and output to the Apache MapReduce
framework for Hadoop, and works with Apache Phoenix (SQL
layer) to enable SQL-like queries over HBase tables.
Features of HBase
 HBase not only ensures scalability and consistency, it has other
features too that make it a popular choice for dealing with big data:

◦ HBase has some built-in features that other databases lack such
as versioning, compression, garbage collection (for expired data),
and in-memory tables.

◦ Having these features available in the database box implies that


application developers will need to write less code for such
requirements.

◦ HBase guarantees atomicity at the row level, which means that


one can have strong consistency at a crucial level of HBase’s data
model.

◦ The fact that HBase guarantees strong consistency, makes it


easier to transition from relational databases to HBase.
HBase Architecture

 HBase lives in the Hadoop ecosystem, where it benefits from


its proximity to other related tools. Based on distributed
system, HBase is, by design, fault tolerant.

 In HBase, row is a collection of column families. A column


family is a collection of columns. A column is a collection of
key value pairs.

 A table in HBase is basically a big map - more accurately, a


map of maps!
 In an HBase table, keys are arbitrary strings that each map to
a row of data.A row is itself a map in which keys are called
columns and values are stored as uninterpreted arrays of
HBase CRUD operations
 In Hbase, columns are grouped into column families, so a column’s
full name consists of two parts: the column family name and the
column qualifier!

 An HBase table might look like if it were a Python dictionary:


hbase_table = { # Table
'row1': { # Row key
'cf1:col1': 'value1', # Column family, column, & value
'cf1:col2': 'value2',
'cf2:col1': 'value3'
},
'row2': {
# More row data
}
HBase CRUD operations
 Create command in HBase supports creation of table,
example: following command creates a table named ‘wiki’
with a single column family named as ‘text’
create 'wiki', 'text‘
When the table is created, it is empty; it has no rows and
no columns!

 Put command allows adding data to HBase table. Example:


put 'wiki', 'Home', 'text:', 'Welcome to the wiki!'
This command inserts a new row into the wiki table with the
key 'Home', adding 'Welcome to the wiki!' to the column
called 'text:‘!
HBase CRUD operations
 Get command in HBase helps retrieving data from the table,
get command requires two parameters: the table name and the
row key. We can optionally specify a list of columns to return.
Example:
get 'wiki', 'Home', 'text:'
This command returns: “value=Welcome to the wiki!”. This
command fetches the value of the text: column from wiki table!

 Scan operations simply return all rows in the entire table.


Scans are powerful and great for development purposes
Example:
scan 'wiki'
HBase – importing Data
 When we setup a new database, one major problem
that we encounter is how to migrate data into it!
 Handcrafting ‘put’ operations with static strings to do this
task can be of help – but that’s a cumbersome solution.

 A better solution would be to have some scripts ready to


migrate data from original data source to HBase!

 Most of the Big Data for which HBase is the best


solution can be exported as XML files with informative
XML tags for which we can write scripts to extract data
and put into HBase table map!
HBase – importing Data
 Often times, the data that needs to be imported to Hbase is
big blobs of text content, which takes longer to read and write!
 HBase has got compression utilities to speed up data reads.

 HBase supports two compression algorithms: Gzip (GZ) and


Lempel-Ziv-Oberhumer (LZO)! LZO has licensing problems –
that makes Gzip a favourable choice over LZO 

 HBase features Bloom filters as a faster way of determining


whether data exists well before incurring an expensive disk
read!!
 A Bloom filter is a useful data structure to determine whether a
particular column exists for a given row key or just whether a
given row key exists at all (BLOOMFILTER=>'ROW').
Attributes of database to explore!

 Nature of problem and usage of database – problems where


“Big Data” processing is a requirement! Example: Airbnb,
Yahoo, eBay, Meetup etc.

 Unique characteristic of database – similar to relational DB


but treats tables, rows and columns differently. Tables are
maps of maps (column family).

 Communication interface of database – HBase provides


command-line interface for interacting with the database!

 Scalability – Highly scalable for Big Data with good


performance!

 Security – One can enable SSH security at HBase cluster level.


Attributes of database to explore!

 Durability – HBase can gracefully recover from individual


server failures because it uses write-ahead logging (WAL),
which writes data to an in-memory log before it’s written to the
disk (so that nodes can use the logs for recovery rather than
disk). This also means that nodes can rely on each other for
configuration rather than on a centralized source.

 Database Replication – HBase does support cluster-to-cluster


replication. A typical multi-cluster setup could have clusters
separated geographically by some distance. In this case, for a
given column family, one cluster is the system of record, while
the other clusters merely provide access to the replicated data.

You might also like