0% found this document useful (0 votes)
10 views

Unit III_Full

nosql

Uploaded by

Medha Harini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Unit III_Full

nosql

Uploaded by

Medha Harini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

20ITEL609 - NoSQL Database Techniques

UNIT III COLUMN ORIENTED DATABASE​ ​ ​ ​ 9


Column- oriented NoSQL databases using Apache HBASE- Column-oriented NoSQL
databases using Apache Cassandra, Architecture of HBASE, What Is a Column-Family
Data Store- Features- Consistency- Transactions- Availability- Query Features-
Scaling- Suitable Use Cases- Event Logging- Content Management Systems- Blogging
Platforms- Counters- Expiring Usage.


Topic Page No
3.1 Column- oriented NoSQL databases using Apache HBASE 1
3.2 Column-oriented NoSQL databases using Apache Cassandra 4
3.3 Architecture of HBASE 9
3.4 What Is a Column-Family Data Store- Features- Consistency- 13
3.5 Transactions- Availability- Query Features- Scaling- Suitable Use Cases 18
3.6 Event Logging- Content Management Systems- 27
3.7 Blogging Platforms- Counters- Expiring Usage. 28
3.8 Visible surface detection methods 30

3.1 Column- oriented NoSQL databases using Apache HBASE

Apache HBase is a widely used, open-source, column-oriented NoSQL database that is


designed to store large amounts of data across many commodity servers. It is modeled after
Google Bigtable and is built on top of Hadoop, leveraging the Hadoop Distributed File
System (HDFS) for storage. HBase is a part of the Apache Hadoop ecosystem and is often
used for handling large-scale, sparse datasets.

Key Characteristics of HBase:

1.​ Column-Oriented Storage:​

○​ HBase stores data in column families, where each column is stored separately
on disk, allowing for more efficient read and write operations for certain
types of queries. This column-oriented storage model is particularly suited
for analytical workloads where access to specific columns in wide tables is
frequent.
2.​ Scalability:​

○​ HBase is designed to scale horizontally. It can handle massive amounts of


data and high-throughput requests by distributing the data across many

1
20ITEL609 - NoSQL Database Techniques

nodes. The data is partitioned into regions, and these regions can be
distributed across multiple servers (region servers) as the data grows.
3.​ Real-Time Access:​

○​ HBase provides real-time read/write access to data. This makes it suitable for
applications requiring low-latency access to large amounts of data.
4.​ Schema-less:​

○​ While HBase tables do have a schema, they are more flexible than traditional
relational databases. Each row can have a different set of columns, and the
columns are organized into families. This is particularly useful when the data
model changes frequently or when data comes in varied forms.
5.​ Strong Consistency:​

○​ HBase offers strong consistency guarantees for data, meaning that once a
write operation is successful, all subsequent reads will reflect that write. This
is important for applications requiring immediate consistency.

HBase Data Model:

●​ Table: Data is stored in tables, each identified by a unique name.


●​ Row: Each table consists of rows, which are identified by a unique row key.
●​ Column Family: Data is grouped into column families, which are the basic storage
units in HBase. Each column family contains multiple columns.
●​ Column: Columns are stored as key-value pairs within column families, and each
column can store multiple versions of data over time (using timestamps).

Benefits of Using HBase:

1.​ High Throughput:​

○​ HBase can handle high throughput for read and write operations, making it
suitable for real-time applications, such as web analytics, sensor data, and
user activity logs.
2.​ Flexible Schema Design:​

○​ HBase allows you to define flexible schema designs. Each row can have
different columns, and new columns can be added without downtime.
3.​ Scalable:​

2
20ITEL609 - NoSQL Database Techniques

○​ It scales linearly as data grows. When more data is added, you can add more
region servers, and HBase will distribute the load across the servers
automatically.
4.​ Integration with Hadoop Ecosystem:​

○​ Since HBase is built on top of Hadoop, it integrates well with other tools in
the Hadoop ecosystem, like Apache Hive, Apache Spark, and Apache Flume,
for big data processing, analytics, and stream processing.
5.​ Distributed and Fault Tolerant:​

○​ HBase is fault-tolerant by design. It replicates data across multiple nodes,


ensuring high availability and data durability. If one node fails, HBase can
continue to operate without data loss.

Use Cases for HBase:

●​ Time-Series Data: HBase is suitable for storing and querying time-series data, such
as logs, sensor data, or event data that needs to be stored over long periods and
accessed efficiently.
●​ Real-Time Analytics: HBase can be used for real-time analytics of large datasets,
especially when queries need to access specific columns rather than entire rows.
●​ Recommendation Systems: Large-scale recommendation engines, like those used
in e-commerce or social media, can benefit from HBase's scalability and real-time
data access.
●​ Big Data Applications: Applications requiring the storage of large volumes of data,
such as those used in telecommunications, finance, and social media, are prime
candidates for HBase.

HBase Architecture:

1.​ Region Server:​

○​ The region server is responsible for managing regions and serving requests.
Each region is a subset of the table's data.
2.​ HMaster:​

○​ The HMaster is responsible for managing the overall cluster, including region
assignments, balancing load, and handling failures.
3.​ ZooKeeper:​

3
20ITEL609 - NoSQL Database Techniques

○​ HBase relies on Apache ZooKeeper for coordination and management of the


HBase cluster. It helps HBase maintain metadata about regions, manage
failovers, and ensure that the region servers are in sync.

Example of an HBase Table:

Consider a table for storing user information with the following columns:

●​ Table: Users
○​ Row Key: user_id
○​ Column Family: profile
■​ Columns: name, email, phone_number
○​ Column Family: preferences
■​ Columns: theme, language

Each user would have a row with a unique user_id, and each row would have data in the
profile and preferences column families.

Conclusion:

Apache HBase is a powerful, column-oriented NoSQL database designed for handling


large-scale, distributed, real-time data. It is an excellent choice for applications that need
high throughput, horizontal scalability, and flexibility in schema design. Its integration with
the Hadoop ecosystem makes it an essential tool for big data applications.

3.2 Column-oriented NoSQL databases using Apache Cassandra

Apache Cassandra is another highly popular, open-source, distributed, column-oriented


NoSQL database designed for handling large amounts of data across many commodity
servers without any single point of failure. Unlike relational databases, Cassandra is
designed to scale horizontally and manage high-velocity, high-volume transactional data. It
is optimized for distributed environments and designed to be highly available and
fault-tolerant.

Key Characteristics of Apache Cassandra:

1.​ Column-Family Data Model:​

○​ Like HBase, Cassandra uses a column-family model to organize data. A


column family is similar to a table in relational databases, but the data is

4
20ITEL609 - NoSQL Database Techniques

stored by column rather than by row, which allows for more efficient queries
on specific columns.
2.​ Distributed and Decentralized:​

○​ Cassandra is designed to be distributed and decentralized. Each node in a


Cassandra cluster is identical, meaning that there is no master node or single
point of failure. This ensures high availability and fault tolerance.
3.​ Scalability:​

○​ Cassandra provides linear scalability, meaning you can add more nodes to a
Cassandra cluster to handle more data and requests without affecting
performance. It is optimized for very large datasets and can scale out
horizontally across thousands of nodes.
4.​ High Availability and Fault Tolerance:​

○​ One of Cassandra's core features is its ability to offer high availability. It does
this by replicating data across multiple nodes and data centers. If a node fails,
Cassandra can still serve requests using other replicas of the data. This is
crucial for applications that cannot afford downtime.
5.​ Eventual Consistency:​

○​ Cassandra follows an eventual consistency model (as opposed to strong


consistency like HBase). It offers tunable consistency, meaning you can
choose between strong consistency, eventual consistency, or somewhere in
between based on the needs of your application. This flexibility allows it to
offer high availability while providing options for different consistency levels.
6.​ Write Optimized:​

○​ Cassandra is optimized for write-heavy workloads, where it can handle a


massive amount of write operations quickly. It uses a log-structured merge
tree (LSM tree) structure to write data efficiently and supports
high-throughput insertions with low latency.

Key Concepts in Cassandra:

1.​ Keyspace:​

○​ A keyspace is the outermost container for data in Cassandra, similar to a


database in relational systems. It defines replication strategies and other
properties for how data is distributed across the cluster.

5
20ITEL609 - NoSQL Database Techniques

2.​ Column Family:​

○​ A column family is the primary data structure in Cassandra. It is similar to a


table in relational databases but has a more flexible schema. Each row in a
column family is identified by a unique primary key, and rows can have
different columns. Columns are grouped into families.
3.​ Row:​

○​ Each row in a column family is uniquely identified by a primary key, which is


composed of a partition key and optional clustering columns.
4.​ Column:​

○​ A column consists of a name, value, and timestamp. Unlike relational


databases, where columns are predefined, Cassandra allows dynamic
addition of new columns to rows.
5.​ Replication:​

○​ Data is replicated across multiple nodes in the cluster to ensure durability


and high availability. The replication factor determines how many copies of
the data are maintained.
6.​ Partition Key and Clustering Key:​

○​ Partition Key: The partition key is used to distribute data across different
nodes in the cluster. All rows with the same partition key will be stored
together on the same node.
○​ Clustering Key: The clustering key determines the order in which rows are
stored within the same partition, which allows for efficient range queries.

Example of a Cassandra Data Model:

Consider an example for storing user information in a Cassandra database.

Keyspace:
CREATE KEYSPACE user_data WITH REPLICATION = {'class' : 'SimpleStrategy',
'replication_factor' : 3};

Column Family (Table):


CREATE TABLE user_data.users (
user_id UUID PRIMARY KEY,

6
20ITEL609 - NoSQL Database Techniques

first_name TEXT,
last_name TEXT,
email TEXT,
phone_number TEXT,
preferences MAP<TEXT, TEXT>
);

In this table:

●​ Primary Key: user_id is the primary key.


●​ The table also stores attributes like first_name, last_name, and preferences as
columns.
●​ The preferences column is a map of key-value pairs that stores user preferences in a
dynamic manner.

Benefits of Using Cassandra:

1.​ Scalability:​

○​ Cassandra is designed to handle massive data volumes. It is capable of scaling


horizontally by adding more nodes to the cluster. This makes it suitable for
applications dealing with large datasets, such as social media platforms, IoT
systems, or big data processing pipelines.
2.​ High Availability:​

○​ By replicating data across multiple nodes (and optionally across multiple


data centers), Cassandra ensures high availability. If a node or data center
goes down, Cassandra can still serve the requests from replicas, providing
resilience against hardware failures.
3.​ Tunable Consistency:​

○​ Cassandra offers tunable consistency, allowing users to adjust the


consistency level of operations based on their needs. For example, you can
prioritize availability over consistency (eventual consistency) or vice versa,
depending on the use case.
4.​ Write Optimized:​

○​ Cassandra is optimized for high throughput on write-heavy workloads,


making it well-suited for applications like log processing, time-series data, or
real-time analytics.

7
20ITEL609 - NoSQL Database Techniques

5.​ Flexible Schema Design:​

○​ While Cassandra has a schema, it allows flexibility for data modeling. Each
row can have different columns, and columns can be added dynamically. This
is beneficial when the data structure evolves over time.
6.​ Distributed Architecture:​

○​ Cassandra’s distributed architecture ensures that it can run on multiple


nodes, with no single point of failure, making it ideal for cloud-based
deployments and globally distributed systems.

Use Cases for Cassandra:

●​ Real-Time Big Data Applications: Due to its ability to handle massive data volumes
and provide low-latency read/write operations, Cassandra is ideal for real-time big
data use cases like social media analytics, recommendation engines, and sensor data.
●​ Time-Series Data: Cassandra’s ability to scale horizontally and handle high-write
loads makes it suitable for time-series data applications such as IoT devices or stock
market data.
●​ Event Logging and Monitoring: Applications that need to store and analyze large
volumes of event or log data benefit from Cassandra’s write optimization and high
availability.
●​ Content Management Systems: Systems that store large, distributed datasets, like
those used for managing digital content (videos, articles, images), can benefit from
Cassandra's ability to handle large, unstructured data.

Cassandra vs HBase:

●​ Data Model: While both HBase and Cassandra use a column-family model,
Cassandra is more flexible in terms of schema design and can handle a wider variety
of workloads. HBase typically requires more upfront schema definition, while
Cassandra allows for more dynamic changes.
●​ Consistency Model: HBase offers strong consistency by default, while Cassandra
uses eventual consistency but allows the user to tune consistency levels.
●​ Cluster Management: HBase relies on a master-slave architecture, while
Cassandra’s peer-to-peer model is fully decentralized.
●​ Performance: Cassandra is often more suited for high-throughput write-heavy
applications, while HBase is more focused on large-scale read-heavy workloads.

Conclusion:

8
20ITEL609 - NoSQL Database Techniques

Apache Cassandra is an excellent choice for distributed, column-oriented NoSQL database


management. Its horizontal scalability, high availability, and flexibility make it ideal for
applications dealing with large-scale data, such as real-time analytics, sensor data, and
content management systems. However, its eventual consistency model might not be
suitable for all use cases, particularly those that require strong consistency guarantees. If
you need a highly available, distributed database that can handle massive amounts of
write-heavy data, Cassandra is a strong option to consider.

3.3 Architecture of HBASE

Apache HBase is a distributed, column-oriented NoSQL database that is part of the Hadoop
ecosystem. It is designed to handle large-scale, sparse datasets across a cluster of
commodity servers. HBase is inspired by Google Bigtable and provides real-time random
access to large datasets, scaling horizontally to meet the needs of big data applications.

Key Components of HBase Architecture:

1.​ HBase Client:​

○​ The HBase client is used by applications to interact with the HBase cluster.
Clients can perform CRUD (Create, Read, Update, Delete) operations on
HBase tables. Clients can directly interact with the region servers or use the
HBase API to connect and query data from the cluster.
2.​ HBase Master:​

○​ The HBase Master is the central management node in an HBase cluster,


responsible for coordinating and managing the overall system. Its primary
duties include:
■​ Region Management: The master handles the assignment of regions
to region servers. It also balances the regions between servers as the
load increases.
■​ Cluster Monitoring: It tracks the health of the HBase cluster and
ensures that region servers are functioning properly.
■​ Meta Data Management: The master manages and serves the
metadata about the cluster, including the hbase:meta table, which
contains information about the location of regions and region servers.
3.​ Region Server:​

9
20ITEL609 - NoSQL Database Techniques

○​ A Region Server is a key component in HBase. It serves as the actual server


responsible for handling read and write requests for a specific set of regions.
○​ Each region server manages one or more regions. A region is a subset of a
table, and each region stores rows for a specific range of row keys.
○​ A region contains a MemStore (for write operations), a HFile (on disk), and
is served by a RegionServer.
○​ Region Servers can handle client requests such as reading or writing data by
interacting with the underlying storage layer (HDFS).
4.​ Region:​

○​ A Region is a horizontal partition of a table. A table is divided into multiple


regions, and each region holds a portion of the table's data based on the row
key range.
○​ Each region is responsible for a subset of rows within a table, and they are
split when the data size grows too large. The region’s data is stored in HFiles
(HDFS blocks).
○​ Region splits are an important feature for scaling the system. When a region
grows beyond a certain size (usually set in the configuration), it will
automatically split into two smaller regions, which are then assigned to
different region servers.
5.​ HFile:​

○​ Data in HBase is stored in HFiles, which are essentially files stored in HDFS.
○​ HFiles are immutable files that store the data of a region. Once data is
written to an HFile, it is not modified. New data is written to a MemStore,
and periodically, the contents of the MemStore are flushed to a new HFile on
HDFS.
○​ HFiles are efficient for large, sequential reads and write operations.

6.​ MemStore:​

○​ The MemStore is an in-memory buffer used for writing data to HBase. When
a write request (like a put operation) is made, data is first stored in the
MemStore, which exists in memory.
○​ When the MemStore reaches a certain threshold (configured by the
hbase.hregion.memstore.flush.size property), its contents are flushed to disk
as an HFile in HDFS. This process is called memstore flushing.

10
20ITEL609 - NoSQL Database Techniques

○​ After flushing, the MemStore is cleared, and the new data is available for
future writes or reads.
7.​ Write-Ahead Log (WAL):​

○​ HBase uses a Write-Ahead Log (WAL) to ensure durability and fault


tolerance for write operations. When a write request is received, the data is
first written to the WAL on disk before being stored in the MemStore.
○​ The WAL acts as a transaction log to recover data in case of a failure. If a
region server crashes, the data in the WAL can be replayed to restore the data
that was in the MemStore before the crash.
8.​ Zookeeper:​

○​ Zookeeper is a critical part of the HBase architecture. It is used for


distributed coordination and management within the HBase cluster.
○​ Zookeeper is responsible for:
■​ Managing HBase Master Election: In a multi-master environment,
Zookeeper ensures that only one master is active at any time.
■​ Region Server Coordination: It tracks which region servers are up
and running, and it helps with the assignment of regions to region
servers.
■​ Cluster Metadata: Zookeeper also stores important metadata related
to HBase, such as the locations of region servers and regions.
9.​ HBase Meta Table (hbase:meta):​

○​ The hbase:meta table is a special system table in HBase that contains


metadata about all the regions in the cluster. It holds information about the
location of all regions (which region server they are assigned to), their row
key range, and other necessary details.
○​ When a client makes a request, the region server consults the hbase:meta
table to determine which region the request should be sent to.

10.​Compaction:​

●​ Compaction is the process of merging smaller HFiles into larger ones to optimize
storage and performance. Over time, as data is written and flushed to disk, multiple
small HFiles are created. As more data is inserted, the number of HFiles increases,
which can degrade performance.

11
20ITEL609 - NoSQL Database Techniques

●​ Minor Compaction: A background process that merges smaller HFiles in the same
region into a larger file.
●​ Major Compaction: A more intensive process that consolidates HFiles and removes
deleted or outdated data (tombstones).

HBase Data Flow:

1.​ Write Process:​

○​ A client sends a write request (e.g., Put) to a region server.


○​ The region server writes the data to the MemStore.
○​ The data is also logged in the WAL for durability.
○​ Once the MemStore reaches a threshold, it is flushed to an HFile on HDFS.
○​ The WAL is periodically archived to maintain fault tolerance.
2.​ Read Process:​

○​ A client sends a read request (e.g., Get) to a region server.


○​ The region server checks its MemStore for the requested data.
○​ If the data is not in the MemStore, the region server searches for it in the
HFiles stored on HDFS.
○​ The region server may consult the hbase:meta table if it is unsure of the
region location.

The architecture of HBase is designed to be distributed, scalable, and fault-tolerant. It


follows a master-slave model, where the HBase Master coordinates the system, and the
Region Servers handle the read and write operations. Data is stored in HFiles in HDFS and
is managed through MemStore and WAL for high throughput and durability. Zookeeper
ensures that the system remains highly available and properly coordinated.

12
20ITEL609 - NoSQL Database Techniques

3.4 What Is a Column-Family Data Store- Features- Consistency

In HBase, consistency refers to the system’s ability to ensure that data remains consistent
across all nodes, especially in a distributed environment. HBase primarily offers strong
consistency for single-row operations but can provide various consistency guarantees for
multi-row operations, depending on how the system is configured and the consistency level
chosen.

Types of Consistency in HBase:

13
20ITEL609 - NoSQL Database Techniques

1.​ Strong Consistency (Single-row operations)​

○​ HBase provides strong consistency for read and write operations on a single
row. This means that once a write is acknowledged, any subsequent read
from the same row will reflect the latest data, regardless of which region
server is accessed.
2.​ Eventual Consistency (Multi-row operations)​

○​ For multi-row operations, HBase may not guarantee immediate consistency


across multiple regions. This is typically the case with multi-region scans or
operations spanning across multiple rows.

HBase Consistency Example Queries:

1. Strong Consistency (Single-Row Consistency)

For single-row operations, HBase guarantees strong consistency. Once a write to a row is
successful, any subsequent reads will return the most recent data for that row.

Example: Put and Get Operations (Strong Consistency)

// Create a Put instance to insert data

Put put = new Put(Bytes.toBytes("row1"));

put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column1"),
Bytes.toBytes("value1"));

// HBase table reference (assuming 'table' is an HTable instance)

table.put(put);

// Retrieve the data from HBase table

Get get = new Get(Bytes.toBytes("row1"));

Result result = table.get(get);

// Reading the value of "column1" from the row "row1"

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Value for column1: " + Bytes.toString(value));

14
20ITEL609 - NoSQL Database Techniques

In this example:

●​ A Put operation writes the value value1 to row1 and column1.


●​ The Get operation retrieves the value for row1 and column1. If this is the first
request after the Put, the value returned will be value1, ensuring strong
consistency.

After the write, all subsequent reads (on the same row) will see the most recent value (e.g.,
"value1") and will reflect the same data consistently across different regions or region
servers.

2. Eventual Consistency (Multi-Region Scans or Multi-Row Operations)

HBase operates with strong consistency for single-row operations, but when it comes to
multi-row or cross-region queries (like scans spanning across regions), consistency can
become eventual. This means the system may return stale data for rows that were recently
updated or involve multiple replicas.

Example: Scan Operation with Potential Eventual Consistency

Scan scan = new Scan();

scan.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

// Assuming 'table' is an HTable instance

ResultScanner scanner = table.getScanner(scan);

// Iterating over the results

for (Result result : scanner) {

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Value: " + Bytes.toString(value));

scanner.close();

In this example:

●​ The Scan operation retrieves multiple rows and columns from HBase.

15
20ITEL609 - NoSQL Database Techniques

●​ If these rows are distributed across different regions or servers, there may be slight
delays in replicating data across the cluster.
●​ If recent updates were made to some of the rows, you may experience eventual
consistency, where the data might not be up-to-date on all replicas immediately.

3. Consistency During Row Mutation Operations (Transactional Consistency)

While HBase does not support full ACID transactions (like relational databases), it ensures
atomicity for operations within a single row. This means that any updates to a row
(whether Put, Delete, or other operations) will be applied atomically to that row.

Example: Atomic Update with Put

// Create a Put instance for row "row2"

Put put = new Put(Bytes.toBytes("row2"));

put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column2"),
Bytes.toBytes("newValue2"));

// HBase table reference (assuming 'table' is an HTable instance)

table.put(put);

// Perform a Get operation on row "row2"

Get get = new Get(Bytes.toBytes("row2"));

Result result = table.get(get);

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column2"));

System.out.println("Updated value: " + Bytes.toString(value));

In this example:

●​ A Put operation is performed on row2, updating column2 with the value


"newValue2".
●​ If another Put or Delete operation is performed within the same row, HBase
ensures atomicity for that row, meaning either all updates will happen, or none will.
This gives a form of row-level consistency.

16
20ITEL609 - NoSQL Database Techniques

4. Consistency During Multi-row Scans (Eventual Consistency)

When performing scans or operations across multiple rows, HBase does not guarantee
immediate consistency. In cases of network partitions or node failures, different region
servers might have inconsistent versions of the data, which can result in eventual
consistency.

Example: Multi-row Scan with Eventual Consistency

Scan scan = new Scan();

scan.addFamily(Bytes.toBytes("cf")); // Scanning entire column family

ResultScanner scanner = table.getScanner(scan);

for (Result result : scanner) {

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Column value: " + Bytes.toString(value));

scanner.close();

In this example:

●​ A Scan operation is done across multiple rows within the column family "cf".
●​ Since this operation spans multiple region servers or regions, the data read might be
eventually consistent, meaning it may not reflect the most recent updates for some
rows, especially if those rows are being replicated or have just been written.

How to Control Consistency in HBase:

●​ Single-Row Operations: For single-row reads and writes, HBase ensures strong
consistency. Therefore, any Put, Get, or Delete operations on a single row will
always be consistent.​

●​ Multi-Row Scans/Queries: While HBase provides strong consistency for single-row


operations, multi-row operations, especially when they span multiple regions, may
experience eventual consistency due to data replication delays, network partitions,
or node failures.​

17
20ITEL609 - NoSQL Database Techniques

●​ Write-Ahead Log (WAL): For durability and consistency, writes are first logged in
the WAL before being applied to the MemStore and HFiles. This ensures that even if
a region server crashes, data can be recovered to the last consistent state from the
WAL.​

●​ Versioning: HBase supports versioning of cells. When a row is updated, the older
versions are retained based on the configured versioning policy. This allows you to
query the row's previous versions, ensuring that the system maintains consistency
in historical data.

3.5 Transactions- Availability- Query Features- Scaling- Suitable Use Cases

Here is a table outlining the Query Features in HBase:

Feature Description Use Case Example

Row Key-based Queries are optimized for retrieving Retrieving a user’s profile
Queries data based on the row key. Efficient based on user ID.
for single row lookups.

Range Queries Supports range scans based on row Fetching all rows for a
keys. Efficient for querying a range of particular time period (e.g.,
rows sorted by row key. logs from a specific date
range).

Column Can query specific column families or Fetching only certain


Family-based even specific columns within a family, columns like the name or
Queries which can optimize scans. email column from a user
profile.

Column-based Querying specific columns in a column Fetching only the “email”


Queries family, reducing the amount of data column from a user profile.
returned.

Scan Allows scanning of the entire table or a Scanning rows for a specific
Operations subset of rows based on conditions like range, e.g., all products from
start/stop row key. row1 to row100.

18
20ITEL609 - NoSQL Database Techniques

No Joins HBase does not support joins like For complex relationships,
relational databases. Queries must be handle joins outside HBase
done separately or handled by external (e.g., in a data processing
tools like Apache Phoenix or Apache layer).
Hive.

No Does not support aggregations like Aggregation must be


Aggregations COUNT, SUM, AVG, etc., natively. handled at the application
level or by integrating
external tools like Apache
Hive.

Single-row HBase provides strong consistency Fetching the most up-to-date


Consistency for single-row operations. Once a write data from a single row, such
is confirmed, reads will reflect the as user preferences or
latest data. session data.

Partial Scans Allows partial scans to filter specific Scanning only a subset of
rows or columns within a range, rows within a specific range
improving scan efficiency. of row keys.

Time-based Supports retrieving historical versions Fetching older versions of a


Queries of data using the versioning feature of row to analyze data changes
HBase. over time.

Filters HBase supports filters to limit the Filtering rows based on


results of queries or scans based on specific conditions, such as a
specific criteria (e.g., value filters, column value greater than a
prefix filters). certain threshold.

Throttling Supports limiting the amount of data Limiting the number of rows
Queries returned in a query to avoid retrieved in a scan or query
overwhelming the client or server. to avoid system overload.

Got it! Let’s use Indian names in the examples. Here’s a more realistic example of querying
in HBase using an Indian student database.

HBase Table: "students"

19
20ITEL609 - NoSQL Database Techniques

Row Key Column Family: Column Family:


Personal Academics

name age

201 Rajesh Kumar 15

202 Priya Sharma 16

203 Arjun Reddy 14

1. Querying a Specific Student

Let’s get all details of student Rajesh Kumar (Row Key: 201):

get 'students', '201'

Output:

COLUMN CELL
personal:name timestamp=..., value=Rajesh Kumar
personal:age timestamp=..., value=15
academics:grade timestamp=..., value=10
academics:marks timestamp=..., value=88

2. Querying a Specific Column (Marks)

To retrieve only marks of Priya Sharma (Row Key: 202):

get 'students', '202', 'academics:marks'

Output:

COLUMN CELL
academics:marks timestamp=..., value=92

3. Scanning Multiple Rows

To get all students' details:

scan 'students'

Output:

20
20ITEL609 - NoSQL Database Techniques

ROW COLUMN+CELL
201 column=personal:name, timestamp=..., value=Rajesh Kumar
201 column=personal:age, timestamp=..., value=15
201 column=academics:grade, timestamp=..., value=10
201 column=academics:marks, timestamp=..., value=88
...

4. Filtering Students Whose Row Key Starts with "20"

If we want to fetch all students whose Row Key starts with "20" (e.g., 201, 202, 203):

scan 'students', {FILTER => "PrefixFilter('20')"}

Output:

ROW COLUMN+CELL
201 column=personal:name, timestamp=..., value=Rajesh Kumar
202 column=personal:name, timestamp=..., value=Priya Sharma
203 column=personal:name, timestamp=..., value=Arjun Reddy
...

5. Using Value Filters (Students with Marks > 90)

To get students who scored more than 90 marks:

scan 'students', {FILTER => "SingleColumnValueFilter('academics', 'marks', >, 'binary:90')"}

Output:

ROW COLUMN+CELL
202 column=personal:name, timestamp=..., value=Priya Sharma

●​ HBase is column-oriented—it efficiently retrieves specific columns rather than


scanning entire rows.
●​ Row Key-based lookups are super fast using get.
●​ Filters help narrow down results, making queries efficient.

Transaction Control in HBase

HBase does not support traditional ACID transactions like relational databases (e.g.,
MySQL, PostgreSQL). However, it provides atomicity at the row level and some
mechanisms to control data consistency. Here are the key transaction control mechanisms:

21
20ITEL609 - NoSQL Database Techniques

1.​ Row-Level Atomicity: All updates to a single row are atomic.


2.​ Check-and-Put: Ensures conditional updates.
3.​ Check-and-Delete: Deletes a row only if a condition is met.
4.​ Increment Operations: Atomic counters for numeric values.
5.​ Batch Operations: Perform multiple mutations (Put/Delete) in a single call.

1. Row-Level Atomicity Example

HBase ensures that all column families of a row are updated atomically.

Scenario: Updating Marks of a Student

Let’s assume we have an HBase table "students":

Row Key Personal:name Academics:marks

101 Rajesh Kumar 85

If we update marks and grade in the same row, HBase ensures atomicity.

put 'students', '101', 'academics:marks', '90'


put 'students', '101', 'academics:grade', 'A'

HBase guarantees that both operations will not be partially written—either both succeed
or none.

2. Conditional Updates using "Check-and-Put"

HBase supports Check-and-Put, which updates a value only if a condition is met.

Scenario: Update marks only if the current marks are 85


checkAndPut 'students', '101', 'academics', 'marks', '85', '95'

✅ If marks = 85, update to 95.


❌ If marks ≠ 85, no update happens.
●​
●​

This is useful for ensuring consistency in concurrent environments.

22
20ITEL609 - NoSQL Database Techniques

3. Conditional Deletion using "Check-and-Delete"

Deletes a row only if a specific condition is met.

Scenario: Delete Student Record if Marks = 95


checkAndDelete 'students', '101', 'academics', 'marks', '95'

✅ If marks = 95, delete the row.


❌ If marks ≠ 95, do nothing.
●​
●​

4. Atomic Increment Operations

HBase provides an atomic increment method to increase numerical values without race
conditions.

Scenario: Increment Marks by 5


increment 'students', '101', 'academics:marks', 5

If the current marks were 95, they become 100.

5. Batch Mutations (Multiple Updates in One Call)

HBase allows batch operations for better performance.

Scenario: Update multiple students' marks in a single batch


hbase(main):001> batch do
put 'students', '101', 'academics:marks', '90'
put 'students', '102', 'academics:marks', '88'
put 'students', '103', 'academics:marks', '92'
end

This reduces network overhead and improves efficiency.

Key Takeaways

●​ HBase does not support full ACID transactions but provides atomic operations at
the row level.
●​ Check-and-Put/Delete ensures conditional updates, preventing race conditions.
●​ Increment is atomic, useful for counters.
●​ Batch mutations improve performance by reducing network calls.

23
20ITEL609 - NoSQL Database Techniques

Suitable use cases for HBase along with explanations:

Use Case Why HBase?

Real-time Big Data HBase handles massive data ingestion and fast queries over
Analytics petabytes of data.

Log Data Storage and Suitable for storing and analyzing logs from servers,
Analysis applications, and IoT devices.

Time-Series Data Efficient for handling sensor data, stock market prices, or
system performance metrics.

Fraud Detection Processes large transaction datasets in real time for


anomaly detection.

Social Media & Stores user interactions, chat messages, and feed data for
Messaging scalability.

Recommendation Powers personalized recommendations in e-commerce,


Systems OTT platforms, and search engines.

Geospatial Data Stores and queries large-scale geospatial datasets (e.g., GPS
Processing tracking).

Genomic Data Analysis Used in bioinformatics to store and process genetic


sequencing data.

IoT Data Storage Handles continuous data streams from IoT devices and
sensors.

Enterprise Data Lakes Works as a NoSQL backend for massive-scale data lakes.

HBase is ideal for write-heavy workloads, real-time read/write access, and large-scale
distributed applications. However, it’s not suitable for transactional systems requiring
strong ACID guarantees.

Here are some HBase query examples using an Employee database with Indian data.
We'll cover basic operations like creating a table, inserting data, retrieving data, and
filtering.

24
20ITEL609 - NoSQL Database Techniques

1. Creating an Employee Table

An employee table typically has columns like emp_id, name, designation, department,
salary, and location.

create 'employee', 'personal', 'professional'

●​ personal column family: Stores name, location.


●​ professional column family: Stores designation, department, salary.

2. Inserting Data

Adding employee records with Indian data.

put 'employee', '101', 'personal:name', 'Rajesh Kumar'

put 'employee', '101', 'personal:location', 'Chennai'

put 'employee', '101', 'professional:designation', 'Software Engineer'

put 'employee', '101', 'professional:department', 'IT'

put 'employee', '101', 'professional:salary', '800000'

put 'employee', '102', 'personal:name', 'Priya Sharma'

put 'employee', '102', 'personal:location', 'Mumbai'

put 'employee', '102', 'professional:designation', 'HR Manager'

put 'employee', '102', 'professional:department', 'HR'

put 'employee', '102', 'professional:salary', '900000'

put 'employee', '103', 'personal:name', 'Amit Singh'

put 'employee', '103', 'personal:location', 'Bangalore'

put 'employee', '103', 'professional:designation', 'Data Scientist'

put 'employee', '103', 'professional:department', 'AI & ML'

put 'employee', '103', 'professional:salary', '1200000'

25
20ITEL609 - NoSQL Database Techniques

3. Retrieving Data

a) Get details of a specific employee

get 'employee', '101'

COLUMN CELL

personal:name timestamp=xxx, value=Rajesh Kumar

personal:location timestamp=xxx, value=Chennai

professional:designation timestamp=xxx, value=Software Engineer

professional:department timestamp=xxx, value=IT

professional:salary timestamp=xxx, value=800000

b) Get only name and location of employee 102

get 'employee', '102', 'personal:name'

get 'employee', '102', 'personal:location'

c) Scan all employee records

scan 'employee'

d) Retrieve employees from a specific location (e.g., Chennai)

scan 'employee', { FILTER => "ValueFilter(=, 'binary:Chennai')" }

e) Retrieve employees earning more than ₹10,00,000

scan 'employee', { FILTER => "SingleColumnValueFilter('professional', 'salary', >,


'binary:1000000')" }

4. Updating an Employee’s Salary

put 'employee', '103', 'professional:salary', '1400000'

5. Deleting an Employee Record

deleteall 'employee', '102'

6. Deleting a Specific Column

26
20ITEL609 - NoSQL Database Techniques

delete 'employee', '101', 'personal:location'

7. Count Total Employees

count 'employee'

3.6 Event Logging- Content Management Systems

Event Logging in Content Management Systems (CMS) Using HBase

HBase, a distributed, scalable, and NoSQL database that runs on Hadoop, is widely used for
handling large datasets and real-time analytics. When integrated with Content Management
Systems (CMS), HBase provides an efficient solution for event logging, ensuring high
availability and fault tolerance.

In a CMS, event logging is essential for tracking user activities, monitoring system
performance, and maintaining security. HBase's capability to handle massive amounts of
structured and semi-structured data makes it an ideal choice for storing event logs.

Why HBase for Event Logging?

1.​ Scalability: HBase can handle petabytes of data across distributed clusters, making it
suitable for large-scale CMS platforms with high user activity.
2.​ Real-time Data Ingestion: HBase supports real-time write operations, allowing
instant logging of events like user logins, content updates, and access control
changes.
3.​ Schema Flexibility: HBase allows dynamic schema evolution, which is crucial for
logging different types of events with varying attributes.
4.​ Fault Tolerance: Built on Hadoop's HDFS, HBase ensures data durability and fault
tolerance, preventing data loss during failures.

Use Cases in CMS Event Logging:

●​ Tracking content modifications (create, edit, delete).


●​ Monitoring user access patterns and login attempts.
●​ Auditing changes in user roles and permissions.
●​ Analyzing system errors and performance metrics.

Advantages:

●​ High-speed data retrieval for real-time analytics.

27
20ITEL609 - NoSQL Database Techniques

●​ Efficient handling of structured and unstructured log data.


●​ Seamless integration with Hadoop ecosystem tools like Apache Spark and Hive for
advanced analytics.

In conclusion, HBase provides a robust and scalable solution for managing event logs in a
CMS, enhancing security, performance monitoring, and compliance.

3.7 Blogging Platforms- Counters- Expiring Usage.

Counters and Expiring Usage in Blogging Platforms Using HBase

HBase, a NoSQL distributed database built on Hadoop, is highly effective for managing
counters and expiring usage data in blogging platforms. Counters are essential for tracking
metrics like post views, likes, and comments, while expiring usage helps manage temporary
data like session activity and user engagement.

1. HBase Counters for Blogging Platforms

HBase provides an increment operation that allows atomic counter updates without
read-modify-write cycles. This is particularly useful for tracking:

●​ Post views
●​ Number of likes or comments
●​ User visits and engagement time

How it Works:

●​ HBase's Increment class allows updating counters efficiently.


●​ Each counter is stored as a column in an HBase table.
●​ For every new event (like a post view), the counter is incremented without locking
the row, ensuring high concurrency.

Example Schema:
Table: blog_metrics
Row Key: post_id
Column Family: counters
Column: views, likes, comments

Increment Command:

28
20ITEL609 - NoSQL Database Techniques

Increment increment = new Increment(Bytes.toBytes("post123"));


increment.addColumn(Bytes.toBytes("counters"), Bytes.toBytes("views"), 1);
hTable.increment(increment);

2. Expiring Usage in HBase

In a blogging platform, certain data (like session logs, temporary comments, or user
activity) should expire after a certain period. HBase provides Time-to-Live (TTL)
functionality to handle this.

How it Works:

●​ The TTL property can be set at the column family level.


●​ Data older than the specified TTL value is automatically deleted by HBase.

Setting TTL in HBase:


alter 'blog_metrics', {NAME => 'session_data', TTL => '86400'}

This command sets the TTL to 24 hours (86400 seconds). Any data older than this will be
automatically deleted.

Use Cases in Blogging Platforms

●​ Tracking post views, likes, and comments using counters.


●​ Managing session data and temporary logs with TTL.
●​ Monitoring user engagement without manual cleanup of old data.

Conclusion

HBase's atomic counters and TTL-based expiring usage make it a powerful choice for
real-time analytics and efficient data management in blogging platforms. This approach
ensures scalability while automatically handling data expiration to optimize storage.

29
20ITEL609 - NoSQL Database Techniques

3.8 Visible surface detection methods

Visible Surface Detection Methods in HBase

In the context of Visible Surface Detection (VSD), which is a critical concept in computer
graphics for determining which surfaces are visible to the viewer, integrating HBase can
provide a scalable and distributed approach for managing large datasets and performing
computations efficiently.

HBase itself is not directly used for rendering graphics, but it plays a vital role in storing,
managing, and analyzing surface data and depth information when handling large 3D
models or scenes.

Why HBase for Visible Surface Detection?

●​ HBase can store large-scale 3D mesh data, depth maps, and z-buffer information.
●​ It supports real-time updates on surface visibility changes.
●​ It allows parallel processing of surface data for faster computations.

1. Storing Depth Information (Z-Buffer Method) in HBase

The Z-Buffer algorithm, a popular method for visible surface detection, relies on storing
depth values for each pixel.

HBase Schema for Z-Buffer Data:


Table: surface_detection
Row Key: pixel_id
Column Family: depth_info
Columns: depth_value, surface_id

Inserting Depth Data in HBase:


Put put = new Put(Bytes.toBytes("pixel_101"));
put.addColumn(Bytes.toBytes("depth_info"), Bytes.toBytes("depth_value"),
Bytes.toBytes(0.75));
put.addColumn(Bytes.toBytes("depth_info"), Bytes.toBytes("surface_id"),
Bytes.toBytes("surface_1"));
hTable.put(put);

2. Back-Face Culling in HBase

30
20ITEL609 - NoSQL Database Techniques

Back-face culling is a technique to eliminate surfaces that are not visible to the camera.
HBase can be used to store surface normals and view vectors and perform filtering on
the backend.

Schema for Surface Normal Data:


Table: surface_data
Row Key: surface_id
Column Family: normals
Columns: normal_vector, view_vector

3. Ray-Casting Algorithm Using HBase

In ray-casting, rays are traced from the camera to intersect with surfaces. HBase can store
ray intersection data and help in parallel processing for large 3D environments.

4. Expiring Surface Visibility Data (Using TTL)

HBase's TTL (Time-To-Live) feature can automatically remove outdated visibility data. For
example, temporary visibility data in dynamic environments can be set to expire after a
certain time.

alter 'surface_detection', {NAME => 'depth_info', TTL => '3600'}

Conclusion

HBase provides a scalable and distributed platform for handling large-scale Visible
Surface Detection (VSD) data. It can manage depth buffers, back-face culling data, and
ray intersection information, making it suitable for real-time rendering and 3D
graphics analysis in cloud environments.

31

You might also like