0% found this document useful (0 votes)

10 views

Unit III_Full

nosql

Uploaded by

Medha Harini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Unit III_Full

nosql

Uploaded by

Medha Harini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

20ITEL609 - NoSQL Database Techniques

UNIT III COLUMN ORIENTED DATABASE 9

Column- oriented NoSQL databases using Apache HBASE- Column-oriented NoSQL
databases using Apache Cassandra, Architecture of HBASE, What Is a Column-Family
Data Store- Features- Consistency- Transactions- Availability- Query Features-
Scaling- Suitable Use Cases- Event Logging- Content Management Systems- Blogging
Platforms- Counters- Expiring Usage.

Topic Page No
3.1 Column- oriented NoSQL databases using Apache HBASE 1
3.2 Column-oriented NoSQL databases using Apache Cassandra 4
3.3 Architecture of HBASE 9
3.4 What Is a Column-Family Data Store- Features- Consistency- 13
3.5 Transactions- Availability- Query Features- Scaling- Suitable Use Cases 18
3.6 Event Logging- Content Management Systems- 27
3.7 Blogging Platforms- Counters- Expiring Usage. 28
3.8 Visible surface detection methods 30

3.1 Column- oriented NoSQL databases using Apache HBASE

Apache HBase is a widely used, open-source, column-oriented NoSQL database that is

designed to store large amounts of data across many commodity servers. It is modeled after
Google Bigtable and is built on top of Hadoop, leveraging the Hadoop Distributed File
System (HDFS) for storage. HBase is a part of the Apache Hadoop ecosystem and is often
used for handling large-scale, sparse datasets.

Key Characteristics of HBase:

1. Column-Oriented Storage:

○ HBase stores data in column families, where each column is stored separately
on disk, allowing for more efficient read and write operations for certain
types of queries. This column-oriented storage model is particularly suited
for analytical workloads where access to specific columns in wide tables is
frequent.
2. Scalability:

○ HBase is designed to scale horizontally. It can handle massive amounts of

data and high-throughput requests by distributing the data across many

1
20ITEL609 - NoSQL Database Techniques

nodes. The data is partitioned into regions, and these regions can be
distributed across multiple servers (region servers) as the data grows.
3. Real-Time Access:

○ HBase provides real-time read/write access to data. This makes it suitable for
applications requiring low-latency access to large amounts of data.
4. Schema-less:

○ While HBase tables do have a schema, they are more flexible than traditional
relational databases. Each row can have a different set of columns, and the
columns are organized into families. This is particularly useful when the data
model changes frequently or when data comes in varied forms.
5. Strong Consistency:

○ HBase offers strong consistency guarantees for data, meaning that once a
write operation is successful, all subsequent reads will reflect that write. This
is important for applications requiring immediate consistency.

HBase Data Model:

● Table: Data is stored in tables, each identified by a unique name.

● Row: Each table consists of rows, which are identified by a unique row key.
● Column Family: Data is grouped into column families, which are the basic storage
units in HBase. Each column family contains multiple columns.
● Column: Columns are stored as key-value pairs within column families, and each
column can store multiple versions of data over time (using timestamps).

Benefits of Using HBase:

1. High Throughput:

○ HBase can handle high throughput for read and write operations, making it
suitable for real-time applications, such as web analytics, sensor data, and
user activity logs.
2. Flexible Schema Design:

○ HBase allows you to define flexible schema designs. Each row can have
different columns, and new columns can be added without downtime.
3. Scalable:

2
20ITEL609 - NoSQL Database Techniques

○ It scales linearly as data grows. When more data is added, you can add more
region servers, and HBase will distribute the load across the servers
automatically.
4. Integration with Hadoop Ecosystem:

○ Since HBase is built on top of Hadoop, it integrates well with other tools in
the Hadoop ecosystem, like Apache Hive, Apache Spark, and Apache Flume,
for big data processing, analytics, and stream processing.
5. Distributed and Fault Tolerant:

○ HBase is fault-tolerant by design. It replicates data across multiple nodes,

ensuring high availability and data durability. If one node fails, HBase can
continue to operate without data loss.

Use Cases for HBase:

● Time-Series Data: HBase is suitable for storing and querying time-series data, such
as logs, sensor data, or event data that needs to be stored over long periods and
accessed efficiently.
● Real-Time Analytics: HBase can be used for real-time analytics of large datasets,
especially when queries need to access specific columns rather than entire rows.
● Recommendation Systems: Large-scale recommendation engines, like those used
in e-commerce or social media, can benefit from HBase's scalability and real-time
data access.
● Big Data Applications: Applications requiring the storage of large volumes of data,
such as those used in telecommunications, finance, and social media, are prime
candidates for HBase.

HBase Architecture:

1. Region Server:

○ The region server is responsible for managing regions and serving requests.
Each region is a subset of the table's data.
2. HMaster:

○ The HMaster is responsible for managing the overall cluster, including region
assignments, balancing load, and handling failures.
3. ZooKeeper:

3
20ITEL609 - NoSQL Database Techniques

○ HBase relies on Apache ZooKeeper for coordination and management of the

HBase cluster. It helps HBase maintain metadata about regions, manage
failovers, and ensure that the region servers are in sync.

Example of an HBase Table:

Consider a table for storing user information with the following columns:

● Table: Users
○ Row Key: user_id
○ Column Family: profile
■ Columns: name, email, phone_number
○ Column Family: preferences
■ Columns: theme, language

Each user would have a row with a unique user_id, and each row would have data in the
profile and preferences column families.

Conclusion:

Apache HBase is a powerful, column-oriented NoSQL database designed for handling

large-scale, distributed, real-time data. It is an excellent choice for applications that need
high throughput, horizontal scalability, and flexibility in schema design. Its integration with
the Hadoop ecosystem makes it an essential tool for big data applications.

3.2 Column-oriented NoSQL databases using Apache Cassandra

Apache Cassandra is another highly popular, open-source, distributed, column-oriented

NoSQL database designed for handling large amounts of data across many commodity
servers without any single point of failure. Unlike relational databases, Cassandra is
designed to scale horizontally and manage high-velocity, high-volume transactional data. It
is optimized for distributed environments and designed to be highly available and
fault-tolerant.

Key Characteristics of Apache Cassandra:

1. Column-Family Data Model:

○ Like HBase, Cassandra uses a column-family model to organize data. A

column family is similar to a table in relational databases, but the data is

4
20ITEL609 - NoSQL Database Techniques

stored by column rather than by row, which allows for more efficient queries
on specific columns.
2. Distributed and Decentralized:

○ Cassandra is designed to be distributed and decentralized. Each node in a

Cassandra cluster is identical, meaning that there is no master node or single
point of failure. This ensures high availability and fault tolerance.
3. Scalability:

○ Cassandra provides linear scalability, meaning you can add more nodes to a
Cassandra cluster to handle more data and requests without affecting
performance. It is optimized for very large datasets and can scale out
horizontally across thousands of nodes.
4. High Availability and Fault Tolerance:

○ One of Cassandra's core features is its ability to offer high availability. It does
this by replicating data across multiple nodes and data centers. If a node fails,
Cassandra can still serve requests using other replicas of the data. This is
crucial for applications that cannot afford downtime.
5. Eventual Consistency:

○ Cassandra follows an eventual consistency model (as opposed to strong

consistency like HBase). It offers tunable consistency, meaning you can
choose between strong consistency, eventual consistency, or somewhere in
between based on the needs of your application. This flexibility allows it to
offer high availability while providing options for different consistency levels.
6. Write Optimized:

○ Cassandra is optimized for write-heavy workloads, where it can handle a

massive amount of write operations quickly. It uses a log-structured merge
tree (LSM tree) structure to write data efficiently and supports
high-throughput insertions with low latency.

Key Concepts in Cassandra:

1. Keyspace:

○ A keyspace is the outermost container for data in Cassandra, similar to a

database in relational systems. It defines replication strategies and other
properties for how data is distributed across the cluster.

5
20ITEL609 - NoSQL Database Techniques

2. Column Family:

○ A column family is the primary data structure in Cassandra. It is similar to a

table in relational databases but has a more flexible schema. Each row in a
column family is identified by a unique primary key, and rows can have
different columns. Columns are grouped into families.
3. Row:

○ Each row in a column family is uniquely identified by a primary key, which is

composed of a partition key and optional clustering columns.
4. Column:

○ A column consists of a name, value, and timestamp. Unlike relational

databases, where columns are predefined, Cassandra allows dynamic
addition of new columns to rows.
5. Replication:

○ Data is replicated across multiple nodes in the cluster to ensure durability

and high availability. The replication factor determines how many copies of
the data are maintained.
6. Partition Key and Clustering Key:

○ Partition Key: The partition key is used to distribute data across different
nodes in the cluster. All rows with the same partition key will be stored
together on the same node.
○ Clustering Key: The clustering key determines the order in which rows are
stored within the same partition, which allows for efficient range queries.

Example of a Cassandra Data Model:

Consider an example for storing user information in a Cassandra database.

Keyspace:
CREATE KEYSPACE user_data WITH REPLICATION = {'class' : 'SimpleStrategy',
'replication_factor' : 3};

Column Family (Table):

CREATE TABLE user_data.users (
user_id UUID PRIMARY KEY,

6
20ITEL609 - NoSQL Database Techniques

first_name TEXT,
last_name TEXT,
email TEXT,
phone_number TEXT,
preferences MAP<TEXT, TEXT>
);

In this table:

● Primary Key: user_id is the primary key.

● The table also stores attributes like first_name, last_name, and preferences as
columns.
● The preferences column is a map of key-value pairs that stores user preferences in a
dynamic manner.

Benefits of Using Cassandra:

1. Scalability:

○ Cassandra is designed to handle massive data volumes. It is capable of scaling

horizontally by adding more nodes to the cluster. This makes it suitable for
applications dealing with large datasets, such as social media platforms, IoT
systems, or big data processing pipelines.
2. High Availability:

○ By replicating data across multiple nodes (and optionally across multiple

data centers), Cassandra ensures high availability. If a node or data center
goes down, Cassandra can still serve the requests from replicas, providing
resilience against hardware failures.
3. Tunable Consistency:

○ Cassandra offers tunable consistency, allowing users to adjust the

consistency level of operations based on their needs. For example, you can
prioritize availability over consistency (eventual consistency) or vice versa,
depending on the use case.
4. Write Optimized:

○ Cassandra is optimized for high throughput on write-heavy workloads,

making it well-suited for applications like log processing, time-series data, or
real-time analytics.

7
20ITEL609 - NoSQL Database Techniques

5. Flexible Schema Design:

○ While Cassandra has a schema, it allows flexibility for data modeling. Each
row can have different columns, and columns can be added dynamically. This
is beneficial when the data structure evolves over time.
6. Distributed Architecture:

○ Cassandra’s distributed architecture ensures that it can run on multiple

nodes, with no single point of failure, making it ideal for cloud-based
deployments and globally distributed systems.

Use Cases for Cassandra:

● Real-Time Big Data Applications: Due to its ability to handle massive data volumes
and provide low-latency read/write operations, Cassandra is ideal for real-time big
data use cases like social media analytics, recommendation engines, and sensor data.
● Time-Series Data: Cassandra’s ability to scale horizontally and handle high-write
loads makes it suitable for time-series data applications such as IoT devices or stock
market data.
● Event Logging and Monitoring: Applications that need to store and analyze large
volumes of event or log data benefit from Cassandra’s write optimization and high
availability.
● Content Management Systems: Systems that store large, distributed datasets, like
those used for managing digital content (videos, articles, images), can benefit from
Cassandra's ability to handle large, unstructured data.

Cassandra vs HBase:

● Data Model: While both HBase and Cassandra use a column-family model,
Cassandra is more flexible in terms of schema design and can handle a wider variety
of workloads. HBase typically requires more upfront schema definition, while
Cassandra allows for more dynamic changes.
● Consistency Model: HBase offers strong consistency by default, while Cassandra
uses eventual consistency but allows the user to tune consistency levels.
● Cluster Management: HBase relies on a master-slave architecture, while
Cassandra’s peer-to-peer model is fully decentralized.
● Performance: Cassandra is often more suited for high-throughput write-heavy
applications, while HBase is more focused on large-scale read-heavy workloads.

Conclusion:

8
20ITEL609 - NoSQL Database Techniques

Apache Cassandra is an excellent choice for distributed, column-oriented NoSQL database

management. Its horizontal scalability, high availability, and flexibility make it ideal for
applications dealing with large-scale data, such as real-time analytics, sensor data, and
content management systems. However, its eventual consistency model might not be
suitable for all use cases, particularly those that require strong consistency guarantees. If
you need a highly available, distributed database that can handle massive amounts of
write-heavy data, Cassandra is a strong option to consider.

3.3 Architecture of HBASE

Apache HBase is a distributed, column-oriented NoSQL database that is part of the Hadoop
ecosystem. It is designed to handle large-scale, sparse datasets across a cluster of
commodity servers. HBase is inspired by Google Bigtable and provides real-time random
access to large datasets, scaling horizontally to meet the needs of big data applications.

Key Components of HBase Architecture:

1. HBase Client:

○ The HBase client is used by applications to interact with the HBase cluster.
Clients can perform CRUD (Create, Read, Update, Delete) operations on
HBase tables. Clients can directly interact with the region servers or use the
HBase API to connect and query data from the cluster.
2. HBase Master:

○ The HBase Master is the central management node in an HBase cluster,

responsible for coordinating and managing the overall system. Its primary
duties include:
■ Region Management: The master handles the assignment of regions
to region servers. It also balances the regions between servers as the
load increases.
■ Cluster Monitoring: It tracks the health of the HBase cluster and
ensures that region servers are functioning properly.
■ Meta Data Management: The master manages and serves the
metadata about the cluster, including the hbase:meta table, which
contains information about the location of regions and region servers.
3. Region Server:

9
20ITEL609 - NoSQL Database Techniques

○ A Region Server is a key component in HBase. It serves as the actual server

responsible for handling read and write requests for a specific set of regions.
○ Each region server manages one or more regions. A region is a subset of a
table, and each region stores rows for a specific range of row keys.
○ A region contains a MemStore (for write operations), a HFile (on disk), and
is served by a RegionServer.
○ Region Servers can handle client requests such as reading or writing data by
interacting with the underlying storage layer (HDFS).
4. Region:

○ A Region is a horizontal partition of a table. A table is divided into multiple

regions, and each region holds a portion of the table's data based on the row
key range.
○ Each region is responsible for a subset of rows within a table, and they are
split when the data size grows too large. The region’s data is stored in HFiles
(HDFS blocks).
○ Region splits are an important feature for scaling the system. When a region
grows beyond a certain size (usually set in the configuration), it will
automatically split into two smaller regions, which are then assigned to
different region servers.
5. HFile:

○ Data in HBase is stored in HFiles, which are essentially files stored in HDFS.
○ HFiles are immutable files that store the data of a region. Once data is
written to an HFile, it is not modified. New data is written to a MemStore,
and periodically, the contents of the MemStore are flushed to a new HFile on
HDFS.
○ HFiles are efficient for large, sequential reads and write operations.

6. MemStore:

○ The MemStore is an in-memory buffer used for writing data to HBase. When
a write request (like a put operation) is made, data is first stored in the
MemStore, which exists in memory.
○ When the MemStore reaches a certain threshold (configured by the
hbase.hregion.memstore.flush.size property), its contents are flushed to disk
as an HFile in HDFS. This process is called memstore flushing.

10
20ITEL609 - NoSQL Database Techniques

○ After flushing, the MemStore is cleared, and the new data is available for
future writes or reads.
7. Write-Ahead Log (WAL):

○ HBase uses a Write-Ahead Log (WAL) to ensure durability and fault

tolerance for write operations. When a write request is received, the data is
first written to the WAL on disk before being stored in the MemStore.
○ The WAL acts as a transaction log to recover data in case of a failure. If a
region server crashes, the data in the WAL can be replayed to restore the data
that was in the MemStore before the crash.
8. Zookeeper:

○ Zookeeper is a critical part of the HBase architecture. It is used for

distributed coordination and management within the HBase cluster.
○ Zookeeper is responsible for:
■ Managing HBase Master Election: In a multi-master environment,
Zookeeper ensures that only one master is active at any time.
■ Region Server Coordination: It tracks which region servers are up
and running, and it helps with the assignment of regions to region
servers.
■ Cluster Metadata: Zookeeper also stores important metadata related
to HBase, such as the locations of region servers and regions.
9. HBase Meta Table (hbase:meta):

○ The hbase:meta table is a special system table in HBase that contains

metadata about all the regions in the cluster. It holds information about the
location of all regions (which region server they are assigned to), their row
key range, and other necessary details.
○ When a client makes a request, the region server consults the hbase:meta
table to determine which region the request should be sent to.

10.Compaction:

● Compaction is the process of merging smaller HFiles into larger ones to optimize
storage and performance. Over time, as data is written and flushed to disk, multiple
small HFiles are created. As more data is inserted, the number of HFiles increases,
which can degrade performance.

11
20ITEL609 - NoSQL Database Techniques

● Minor Compaction: A background process that merges smaller HFiles in the same
region into a larger file.
● Major Compaction: A more intensive process that consolidates HFiles and removes
deleted or outdated data (tombstones).

HBase Data Flow:

1. Write Process:

○ A client sends a write request (e.g., Put) to a region server.

○ The region server writes the data to the MemStore.
○ The data is also logged in the WAL for durability.
○ Once the MemStore reaches a threshold, it is flushed to an HFile on HDFS.
○ The WAL is periodically archived to maintain fault tolerance.
2. Read Process:

○ A client sends a read request (e.g., Get) to a region server.

○ The region server checks its MemStore for the requested data.
○ If the data is not in the MemStore, the region server searches for it in the
HFiles stored on HDFS.
○ The region server may consult the hbase:meta table if it is unsure of the
region location.

The architecture of HBase is designed to be distributed, scalable, and fault-tolerant. It

follows a master-slave model, where the HBase Master coordinates the system, and the
Region Servers handle the read and write operations. Data is stored in HFiles in HDFS and
is managed through MemStore and WAL for high throughput and durability. Zookeeper
ensures that the system remains highly available and properly coordinated.

12
20ITEL609 - NoSQL Database Techniques

3.4 What Is a Column-Family Data Store- Features- Consistency

In HBase, consistency refers to the system’s ability to ensure that data remains consistent
across all nodes, especially in a distributed environment. HBase primarily offers strong
consistency for single-row operations but can provide various consistency guarantees for
multi-row operations, depending on how the system is configured and the consistency level
chosen.

Types of Consistency in HBase:

13
20ITEL609 - NoSQL Database Techniques

1. Strong Consistency (Single-row operations)

○ HBase provides strong consistency for read and write operations on a single
row. This means that once a write is acknowledged, any subsequent read
from the same row will reflect the latest data, regardless of which region
server is accessed.
2. Eventual Consistency (Multi-row operations)

○ For multi-row operations, HBase may not guarantee immediate consistency

across multiple regions. This is typically the case with multi-region scans or
operations spanning across multiple rows.

HBase Consistency Example Queries:

1. Strong Consistency (Single-Row Consistency)

For single-row operations, HBase guarantees strong consistency. Once a write to a row is
successful, any subsequent reads will return the most recent data for that row.

Example: Put and Get Operations (Strong Consistency)

// Create a Put instance to insert data

Put put = new Put(Bytes.toBytes("row1"));

put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column1"),
Bytes.toBytes("value1"));

// HBase table reference (assuming 'table' is an HTable instance)

table.put(put);

// Retrieve the data from HBase table

Get get = new Get(Bytes.toBytes("row1"));

Result result = table.get(get);

// Reading the value of "column1" from the row "row1"

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Value for column1: " + Bytes.toString(value));

14
20ITEL609 - NoSQL Database Techniques

In this example:

● A Put operation writes the value value1 to row1 and column1.

● The Get operation retrieves the value for row1 and column1. If this is the first
request after the Put, the value returned will be value1, ensuring strong
consistency.

After the write, all subsequent reads (on the same row) will see the most recent value (e.g.,
"value1") and will reflect the same data consistently across different regions or region
servers.

2. Eventual Consistency (Multi-Region Scans or Multi-Row Operations)

HBase operates with strong consistency for single-row operations, but when it comes to
multi-row or cross-region queries (like scans spanning across regions), consistency can
become eventual. This means the system may return stale data for rows that were recently
updated or involve multiple replicas.

Example: Scan Operation with Potential Eventual Consistency

Scan scan = new Scan();

scan.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

// Assuming 'table' is an HTable instance

ResultScanner scanner = table.getScanner(scan);

// Iterating over the results

for (Result result : scanner) {

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Value: " + Bytes.toString(value));

scanner.close();

In this example:

● The Scan operation retrieves multiple rows and columns from HBase.

15
20ITEL609 - NoSQL Database Techniques

● If these rows are distributed across different regions or servers, there may be slight
delays in replicating data across the cluster.
● If recent updates were made to some of the rows, you may experience eventual
consistency, where the data might not be up-to-date on all replicas immediately.

3. Consistency During Row Mutation Operations (Transactional Consistency)

While HBase does not support full ACID transactions (like relational databases), it ensures
atomicity for operations within a single row. This means that any updates to a row
(whether Put, Delete, or other operations) will be applied atomically to that row.

Example: Atomic Update with Put

// Create a Put instance for row "row2"

Put put = new Put(Bytes.toBytes("row2"));

put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column2"),
Bytes.toBytes("newValue2"));

// HBase table reference (assuming 'table' is an HTable instance)

table.put(put);

// Perform a Get operation on row "row2"

Get get = new Get(Bytes.toBytes("row2"));

Result result = table.get(get);

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column2"));

System.out.println("Updated value: " + Bytes.toString(value));

In this example:

● A Put operation is performed on row2, updating column2 with the value

"newValue2".
● If another Put or Delete operation is performed within the same row, HBase
ensures atomicity for that row, meaning either all updates will happen, or none will.
This gives a form of row-level consistency.

16
20ITEL609 - NoSQL Database Techniques

4. Consistency During Multi-row Scans (Eventual Consistency)

When performing scans or operations across multiple rows, HBase does not guarantee
immediate consistency. In cases of network partitions or node failures, different region
servers might have inconsistent versions of the data, which can result in eventual
consistency.

Example: Multi-row Scan with Eventual Consistency

Scan scan = new Scan();

scan.addFamily(Bytes.toBytes("cf")); // Scanning entire column family

ResultScanner scanner = table.getScanner(scan);

for (Result result : scanner) {

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Column value: " + Bytes.toString(value));

scanner.close();

In this example:

● A Scan operation is done across multiple rows within the column family "cf".
● Since this operation spans multiple region servers or regions, the data read might be
eventually consistent, meaning it may not reflect the most recent updates for some
rows, especially if those rows are being replicated or have just been written.

How to Control Consistency in HBase:

● Single-Row Operations: For single-row reads and writes, HBase ensures strong
consistency. Therefore, any Put, Get, or Delete operations on a single row will
always be consistent.

● Multi-Row Scans/Queries: While HBase provides strong consistency for single-row

operations, multi-row operations, especially when they span multiple regions, may
experience eventual consistency due to data replication delays, network partitions,
or node failures.

17
20ITEL609 - NoSQL Database Techniques

● Write-Ahead Log (WAL): For durability and consistency, writes are first logged in
the WAL before being applied to the MemStore and HFiles. This ensures that even if
a region server crashes, data can be recovered to the last consistent state from the
WAL.

● Versioning: HBase supports versioning of cells. When a row is updated, the older
versions are retained based on the configured versioning policy. This allows you to
query the row's previous versions, ensuring that the system maintains consistency
in historical data.

3.5 Transactions- Availability- Query Features- Scaling- Suitable Use Cases

Here is a table outlining the Query Features in HBase:

Feature Description Use Case Example

Row Key-based Queries are optimized for retrieving Retrieving a user’s profile
Queries data based on the row key. Efficient based on user ID.
for single row lookups.

Range Queries Supports range scans based on row Fetching all rows for a
keys. Efficient for querying a range of particular time period (e.g.,
rows sorted by row key. logs from a specific date
range).

Column Can query specific column families or Fetching only certain

Family-based even specific columns within a family, columns like the name or
Queries which can optimize scans. email column from a user
profile.

Column-based Querying specific columns in a column Fetching only the “email”

Queries family, reducing the amount of data column from a user profile.
returned.

Scan Allows scanning of the entire table or a Scanning rows for a specific
Operations subset of rows based on conditions like range, e.g., all products from
start/stop row key. row1 to row100.

18
20ITEL609 - NoSQL Database Techniques

No Joins HBase does not support joins like For complex relationships,
relational databases. Queries must be handle joins outside HBase
done separately or handled by external (e.g., in a data processing
tools like Apache Phoenix or Apache layer).
Hive.

No Does not support aggregations like Aggregation must be

Aggregations COUNT, SUM, AVG, etc., natively. handled at the application
level or by integrating
external tools like Apache
Hive.

Single-row HBase provides strong consistency Fetching the most up-to-date

Consistency for single-row operations. Once a write data from a single row, such
is confirmed, reads will reflect the as user preferences or
latest data. session data.

Partial Scans Allows partial scans to filter specific Scanning only a subset of
rows or columns within a range, rows within a specific range
improving scan efficiency. of row keys.

Time-based Supports retrieving historical versions Fetching older versions of a

Queries of data using the versioning feature of row to analyze data changes
HBase. over time.

Filters HBase supports filters to limit the Filtering rows based on

results of queries or scans based on specific conditions, such as a
specific criteria (e.g., value filters, column value greater than a
prefix filters). certain threshold.

Throttling Supports limiting the amount of data Limiting the number of rows
Queries returned in a query to avoid retrieved in a scan or query
overwhelming the client or server. to avoid system overload.

Got it! Let’s use Indian names in the examples. Here’s a more realistic example of querying
in HBase using an Indian student database.

HBase Table: "students"

19
20ITEL609 - NoSQL Database Techniques

Row Key Column Family: Column Family:

Personal Academics

name age

201 Rajesh Kumar 15

202 Priya Sharma 16

203 Arjun Reddy 14

1. Querying a Specific Student

Let’s get all details of student Rajesh Kumar (Row Key: 201):

get 'students', '201'

Output:

COLUMN CELL
personal:name timestamp=..., value=Rajesh Kumar
personal:age timestamp=..., value=15
academics:grade timestamp=..., value=10
academics:marks timestamp=..., value=88

2. Querying a Specific Column (Marks)

To retrieve only marks of Priya Sharma (Row Key: 202):

get 'students', '202', 'academics:marks'

Output:

COLUMN CELL
academics:marks timestamp=..., value=92

3. Scanning Multiple Rows

To get all students' details:

scan 'students'

Output:

20
20ITEL609 - NoSQL Database Techniques

ROW COLUMN+CELL
201 column=personal:name, timestamp=..., value=Rajesh Kumar
201 column=personal:age, timestamp=..., value=15
201 column=academics:grade, timestamp=..., value=10
201 column=academics:marks, timestamp=..., value=88
...

4. Filtering Students Whose Row Key Starts with "20"

If we want to fetch all students whose Row Key starts with "20" (e.g., 201, 202, 203):

scan 'students', {FILTER => "PrefixFilter('20')"}

Output:

ROW COLUMN+CELL
201 column=personal:name, timestamp=..., value=Rajesh Kumar
202 column=personal:name, timestamp=..., value=Priya Sharma
203 column=personal:name, timestamp=..., value=Arjun Reddy
...

5. Using Value Filters (Students with Marks > 90)

To get students who scored more than 90 marks:

scan 'students', {FILTER => "SingleColumnValueFilter('academics', 'marks', >, 'binary:90')"}

Output:

ROW COLUMN+CELL
202 column=personal:name, timestamp=..., value=Priya Sharma

● HBase is column-oriented—it efficiently retrieves specific columns rather than

scanning entire rows.
● Row Key-based lookups are super fast using get.
● Filters help narrow down results, making queries efficient.

Transaction Control in HBase

HBase does not support traditional ACID transactions like relational databases (e.g.,
MySQL, PostgreSQL). However, it provides atomicity at the row level and some
mechanisms to control data consistency. Here are the key transaction control mechanisms:

21
20ITEL609 - NoSQL Database Techniques

1. Row-Level Atomicity: All updates to a single row are atomic.

2. Check-and-Put: Ensures conditional updates.
3. Check-and-Delete: Deletes a row only if a condition is met.
4. Increment Operations: Atomic counters for numeric values.
5. Batch Operations: Perform multiple mutations (Put/Delete) in a single call.

1. Row-Level Atomicity Example

HBase ensures that all column families of a row are updated atomically.

Scenario: Updating Marks of a Student

Let’s assume we have an HBase table "students":

Row Key Personal:name Academics:marks

101 Rajesh Kumar 85

If we update marks and grade in the same row, HBase ensures atomicity.

put 'students', '101', 'academics:marks', '90'

put 'students', '101', 'academics:grade', 'A'

HBase guarantees that both operations will not be partially written—either both succeed
or none.

2. Conditional Updates using "Check-and-Put"

HBase supports Check-and-Put, which updates a value only if a condition is met.

Scenario: Update marks only if the current marks are 85

checkAndPut 'students', '101', 'academics', 'marks', '85', '95'

✅ If marks = 85, update to 95.

❌ If marks ≠ 85, no update happens.
●
●

This is useful for ensuring consistency in concurrent environments.

22
20ITEL609 - NoSQL Database Techniques

3. Conditional Deletion using "Check-and-Delete"

Deletes a row only if a specific condition is met.

Scenario: Delete Student Record if Marks = 95

checkAndDelete 'students', '101', 'academics', 'marks', '95'

✅ If marks = 95, delete the row.

❌ If marks ≠ 95, do nothing.
●
●

4. Atomic Increment Operations

HBase provides an atomic increment method to increase numerical values without race
conditions.

Scenario: Increment Marks by 5

increment 'students', '101', 'academics:marks', 5

If the current marks were 95, they become 100.

5. Batch Mutations (Multiple Updates in One Call)

HBase allows batch operations for better performance.

Scenario: Update multiple students' marks in a single batch

hbase(main):001> batch do
put 'students', '101', 'academics:marks', '90'
put 'students', '102', 'academics:marks', '88'
put 'students', '103', 'academics:marks', '92'
end

This reduces network overhead and improves efficiency.

Key Takeaways

● HBase does not support full ACID transactions but provides atomic operations at
the row level.
● Check-and-Put/Delete ensures conditional updates, preventing race conditions.
● Increment is atomic, useful for counters.
● Batch mutations improve performance by reducing network calls.

23
20ITEL609 - NoSQL Database Techniques

Suitable use cases for HBase along with explanations:

Use Case Why HBase?

Real-time Big Data HBase handles massive data ingestion and fast queries over
Analytics petabytes of data.

Log Data Storage and Suitable for storing and analyzing logs from servers,
Analysis applications, and IoT devices.

Time-Series Data Efficient for handling sensor data, stock market prices, or
system performance metrics.

Fraud Detection Processes large transaction datasets in real time for

anomaly detection.

Social Media & Stores user interactions, chat messages, and feed data for
Messaging scalability.

Recommendation Powers personalized recommendations in e-commerce,

Systems OTT platforms, and search engines.

Geospatial Data Stores and queries large-scale geospatial datasets (e.g., GPS
Processing tracking).

Genomic Data Analysis Used in bioinformatics to store and process genetic

sequencing data.

IoT Data Storage Handles continuous data streams from IoT devices and
sensors.

Enterprise Data Lakes Works as a NoSQL backend for massive-scale data lakes.

HBase is ideal for write-heavy workloads, real-time read/write access, and large-scale
distributed applications. However, it’s not suitable for transactional systems requiring
strong ACID guarantees.

Here are some HBase query examples using an Employee database with Indian data.
We'll cover basic operations like creating a table, inserting data, retrieving data, and
filtering.

24
20ITEL609 - NoSQL Database Techniques

1. Creating an Employee Table

An employee table typically has columns like emp_id, name, designation, department,
salary, and location.

create 'employee', 'personal', 'professional'

● personal column family: Stores name, location.

● professional column family: Stores designation, department, salary.

2. Inserting Data

Adding employee records with Indian data.

put 'employee', '101', 'personal:name', 'Rajesh Kumar'

put 'employee', '101', 'personal:location', 'Chennai'

put 'employee', '101', 'professional:designation', 'Software Engineer'

put 'employee', '101', 'professional:department', 'IT'

put 'employee', '101', 'professional:salary', '800000'

put 'employee', '102', 'personal:name', 'Priya Sharma'

put 'employee', '102', 'personal:location', 'Mumbai'

put 'employee', '102', 'professional:designation', 'HR Manager'

put 'employee', '102', 'professional:department', 'HR'

put 'employee', '102', 'professional:salary', '900000'

put 'employee', '103', 'personal:name', 'Amit Singh'

put 'employee', '103', 'personal:location', 'Bangalore'

put 'employee', '103', 'professional:designation', 'Data Scientist'

put 'employee', '103', 'professional:department', 'AI & ML'

put 'employee', '103', 'professional:salary', '1200000'

25
20ITEL609 - NoSQL Database Techniques

3. Retrieving Data

a) Get details of a specific employee

get 'employee', '101'

COLUMN CELL

personal:name timestamp=xxx, value=Rajesh Kumar

personal:location timestamp=xxx, value=Chennai

professional:designation timestamp=xxx, value=Software Engineer

professional:department timestamp=xxx, value=IT

professional:salary timestamp=xxx, value=800000

b) Get only name and location of employee 102

get 'employee', '102', 'personal:name'

get 'employee', '102', 'personal:location'

c) Scan all employee records

scan 'employee'

d) Retrieve employees from a specific location (e.g., Chennai)

scan 'employee', { FILTER => "ValueFilter(=, 'binary:Chennai')" }

e) Retrieve employees earning more than ₹10,00,000

scan 'employee', { FILTER => "SingleColumnValueFilter('professional', 'salary', >,

'binary:1000000')" }

4. Updating an Employee’s Salary

put 'employee', '103', 'professional:salary', '1400000'

5. Deleting an Employee Record

deleteall 'employee', '102'

6. Deleting a Specific Column

26
20ITEL609 - NoSQL Database Techniques

delete 'employee', '101', 'personal:location'

7. Count Total Employees

count 'employee'

3.6 Event Logging- Content Management Systems

Event Logging in Content Management Systems (CMS) Using HBase

HBase, a distributed, scalable, and NoSQL database that runs on Hadoop, is widely used for
handling large datasets and real-time analytics. When integrated with Content Management
Systems (CMS), HBase provides an efficient solution for event logging, ensuring high
availability and fault tolerance.

In a CMS, event logging is essential for tracking user activities, monitoring system
performance, and maintaining security. HBase's capability to handle massive amounts of
structured and semi-structured data makes it an ideal choice for storing event logs.

Why HBase for Event Logging?

1. Scalability: HBase can handle petabytes of data across distributed clusters, making it
suitable for large-scale CMS platforms with high user activity.
2. Real-time Data Ingestion: HBase supports real-time write operations, allowing
instant logging of events like user logins, content updates, and access control
changes.
3. Schema Flexibility: HBase allows dynamic schema evolution, which is crucial for
logging different types of events with varying attributes.
4. Fault Tolerance: Built on Hadoop's HDFS, HBase ensures data durability and fault
tolerance, preventing data loss during failures.

Use Cases in CMS Event Logging:

● Tracking content modifications (create, edit, delete).

● Monitoring user access patterns and login attempts.
● Auditing changes in user roles and permissions.
● Analyzing system errors and performance metrics.

Advantages:

● High-speed data retrieval for real-time analytics.

27
20ITEL609 - NoSQL Database Techniques

● Efficient handling of structured and unstructured log data.

● Seamless integration with Hadoop ecosystem tools like Apache Spark and Hive for
advanced analytics.

In conclusion, HBase provides a robust and scalable solution for managing event logs in a
CMS, enhancing security, performance monitoring, and compliance.

3.7 Blogging Platforms- Counters- Expiring Usage.

Counters and Expiring Usage in Blogging Platforms Using HBase

HBase, a NoSQL distributed database built on Hadoop, is highly effective for managing
counters and expiring usage data in blogging platforms. Counters are essential for tracking
metrics like post views, likes, and comments, while expiring usage helps manage temporary
data like session activity and user engagement.

1. HBase Counters for Blogging Platforms

HBase provides an increment operation that allows atomic counter updates without
read-modify-write cycles. This is particularly useful for tracking:

● Post views
● Number of likes or comments
● User visits and engagement time

How it Works:

● HBase's Increment class allows updating counters efficiently.

● Each counter is stored as a column in an HBase table.
● For every new event (like a post view), the counter is incremented without locking
the row, ensuring high concurrency.

Example Schema:
Table: blog_metrics
Row Key: post_id
Column Family: counters
Column: views, likes, comments

Increment Command:

28
20ITEL609 - NoSQL Database Techniques

Increment increment = new Increment(Bytes.toBytes("post123"));

increment.addColumn(Bytes.toBytes("counters"), Bytes.toBytes("views"), 1);
hTable.increment(increment);

2. Expiring Usage in HBase

In a blogging platform, certain data (like session logs, temporary comments, or user
activity) should expire after a certain period. HBase provides Time-to-Live (TTL)
functionality to handle this.

How it Works:

● The TTL property can be set at the column family level.

● Data older than the specified TTL value is automatically deleted by HBase.

Setting TTL in HBase:

alter 'blog_metrics', {NAME => 'session_data', TTL => '86400'}

This command sets the TTL to 24 hours (86400 seconds). Any data older than this will be
automatically deleted.

Use Cases in Blogging Platforms

● Tracking post views, likes, and comments using counters.

● Managing session data and temporary logs with TTL.
● Monitoring user engagement without manual cleanup of old data.

Conclusion

HBase's atomic counters and TTL-based expiring usage make it a powerful choice for
real-time analytics and efficient data management in blogging platforms. This approach
ensures scalability while automatically handling data expiration to optimize storage.

29
20ITEL609 - NoSQL Database Techniques

3.8 Visible surface detection methods

Visible Surface Detection Methods in HBase

In the context of Visible Surface Detection (VSD), which is a critical concept in computer
graphics for determining which surfaces are visible to the viewer, integrating HBase can
provide a scalable and distributed approach for managing large datasets and performing
computations efficiently.

HBase itself is not directly used for rendering graphics, but it plays a vital role in storing,
managing, and analyzing surface data and depth information when handling large 3D
models or scenes.

Why HBase for Visible Surface Detection?

● HBase can store large-scale 3D mesh data, depth maps, and z-buffer information.
● It supports real-time updates on surface visibility changes.
● It allows parallel processing of surface data for faster computations.

1. Storing Depth Information (Z-Buffer Method) in HBase

The Z-Buffer algorithm, a popular method for visible surface detection, relies on storing
depth values for each pixel.

HBase Schema for Z-Buffer Data:

Table: surface_detection
Row Key: pixel_id
Column Family: depth_info
Columns: depth_value, surface_id

Inserting Depth Data in HBase:

Put put = new Put(Bytes.toBytes("pixel_101"));
put.addColumn(Bytes.toBytes("depth_info"), Bytes.toBytes("depth_value"),
Bytes.toBytes(0.75));
put.addColumn(Bytes.toBytes("depth_info"), Bytes.toBytes("surface_id"),
Bytes.toBytes("surface_1"));
hTable.put(put);

2. Back-Face Culling in HBase

30
20ITEL609 - NoSQL Database Techniques

Back-face culling is a technique to eliminate surfaces that are not visible to the camera.
HBase can be used to store surface normals and view vectors and perform filtering on
the backend.

Schema for Surface Normal Data:

Table: surface_data
Row Key: surface_id
Column Family: normals
Columns: normal_vector, view_vector

3. Ray-Casting Algorithm Using HBase

In ray-casting, rays are traced from the camera to intersect with surfaces. HBase can store
ray intersection data and help in parallel processing for large 3D environments.

4. Expiring Surface Visibility Data (Using TTL)

HBase's TTL (Time-To-Live) feature can automatically remove outdated visibility data. For
example, temporary visibility data in dynamic environments can be set to expire after a
certain time.

alter 'surface_detection', {NAME => 'depth_info', TTL => '3600'}

Conclusion

HBase provides a scalable and distributed platform for handling large-scale Visible
Surface Detection (VSD) data. It can manage depth buffers, back-face culling data, and
ray intersection information, making it suitable for real-time rendering and 3D
graphics analysis in cloud environments.

Edgardfreitas 2016
No ratings yet
Edgardfreitas 2016
100 pages
HBase
No ratings yet
HBase
30 pages
BDT UNIT - V
No ratings yet
BDT UNIT - V
15 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
unit 3 hbase,mongodb and couch db
No ratings yet
unit 3 hbase,mongodb and couch db
12 pages
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
No ratings yet
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
54 pages
Big data UNIT 5 own
No ratings yet
Big data UNIT 5 own
18 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Unit - IV_Notes
No ratings yet
Unit - IV_Notes
23 pages
Hbase
No ratings yet
Hbase
23 pages
Hbase
No ratings yet
Hbase
13 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
HBase
No ratings yet
HBase
31 pages
Unit 5 Lecture No-3(Hbase)
No ratings yet
Unit 5 Lecture No-3(Hbase)
35 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
9 HBase
No ratings yet
9 HBase
77 pages
Lecture10 HBase
No ratings yet
Lecture10 HBase
70 pages
HBase
No ratings yet
HBase
38 pages
DOC-20250429-WA0005. (1)
No ratings yet
DOC-20250429-WA0005. (1)
53 pages
DBMS Unit3
No ratings yet
DBMS Unit3
28 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
unit-5 notes
No ratings yet
unit-5 notes
61 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
Unit 5 Hbase
No ratings yet
Unit 5 Hbase
15 pages
NoSql-Unit-2
No ratings yet
NoSql-Unit-2
72 pages
Unit 5
No ratings yet
Unit 5
10 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
HBase
No ratings yet
HBase
6 pages
Hbase
No ratings yet
Hbase
6 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
Unit 5 Lecture No-3(Hbase)
No ratings yet
Unit 5 Lecture No-3(Hbase)
35 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
BDA.Unit-5
No ratings yet
BDA.Unit-5
31 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
10_HBase
No ratings yet
10_HBase
13 pages
HBASE
No ratings yet
HBASE
11 pages
Hbase in Practice
No ratings yet
Hbase in Practice
46 pages
HBase
No ratings yet
HBase
39 pages
HBASE (1)
No ratings yet
HBASE (1)
18 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
BDA Unit-4 Part-2 HBase,Hive,Pig
No ratings yet
BDA Unit-4 Part-2 HBase,Hive,Pig
74 pages
W11 (4)
No ratings yet
W11 (4)
22 pages
BDM Unit 5
No ratings yet
BDM Unit 5
60 pages
bigdata-chap3 notes
No ratings yet
bigdata-chap3 notes
11 pages
Columnar Database
No ratings yet
Columnar Database
18 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
HBase
No ratings yet
HBase
27 pages
lec18
No ratings yet
lec18
21 pages
Module V
No ratings yet
Module V
46 pages
Big Data Analytics & Technologies: Hbase
No ratings yet
Big Data Analytics & Technologies: Hbase
30 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Convergence English Paper Presentation
No ratings yet
Convergence English Paper Presentation
1 page
MedDel App links
No ratings yet
MedDel App links
2 pages
IOT
No ratings yet
IOT
11 pages
IoT Networking
No ratings yet
IoT Networking
6 pages
1.5 Entity Relationship Model
No ratings yet
1.5 Entity Relationship Model
10 pages
1.5 Enhanced e R Model
No ratings yet
1.5 Enhanced e R Model
5 pages
4.6 Algorithms For Select and Join Operations
No ratings yet
4.6 Algorithms For Select and Join Operations
6 pages
Inmex India Connect
No ratings yet
Inmex India Connect
10 pages
13 Seer Condensing Unit Installer'S Information Manual: Attention, Installer! Attention, User!
No ratings yet
13 Seer Condensing Unit Installer'S Information Manual: Attention, Installer! Attention, User!
9 pages
Overview of Financial Accounting in SAP
No ratings yet
Overview of Financial Accounting in SAP
31 pages
Mid_Unit_Quiz_A
No ratings yet
Mid_Unit_Quiz_A
2 pages
Intelligent Continuous Improvement When BPM Meets AI
No ratings yet
Intelligent Continuous Improvement When BPM Meets AI
52 pages
Oil Test Description
No ratings yet
Oil Test Description
21 pages
(7 EL 20m/15m/10m YAGI) : Ta7Om
No ratings yet
(7 EL 20m/15m/10m YAGI) : Ta7Om
18 pages
2 CarReservation Modeling Exercises Solution
No ratings yet
2 CarReservation Modeling Exercises Solution
4 pages
CCNP Interview Questions
No ratings yet
CCNP Interview Questions
4 pages
Huawei Nova 3 SCH
100% (1)
Huawei Nova 3 SCH
74 pages
Assignment 1
No ratings yet
Assignment 1
27 pages
Sana Raises $34M For Its AI-based Knowledge Management and Learning Platform For Workplaces - TechCrunch
No ratings yet
Sana Raises $34M For Its AI-based Knowledge Management and Learning Platform For Workplaces - TechCrunch
16 pages
7UT85 setting
No ratings yet
7UT85 setting
14 pages
plt-03704-a.3-hid-signo-installation-guide-en-fr_de-sp-ru-pt-it-ze-jp-ko
No ratings yet
plt-03704-a.3-hid-signo-installation-guide-en-fr_de-sp-ru-pt-it-ze-jp-ko
44 pages
Automation BOQ
No ratings yet
Automation BOQ
3 pages
DDS300 Intraoral Scanner Introduction - Dynamic
No ratings yet
DDS300 Intraoral Scanner Introduction - Dynamic
6 pages
Load Kerja QC
No ratings yet
Load Kerja QC
43 pages
Ilicore: Motor Speed Control Circuit D6650
No ratings yet
Ilicore: Motor Speed Control Circuit D6650
4 pages
VSP 5000 Series Datasheet
No ratings yet
VSP 5000 Series Datasheet
3 pages
1 4990045931098341906
No ratings yet
1 4990045931098341906
3 pages
CBIC Connectivity Partner Protocol 01112022
No ratings yet
CBIC Connectivity Partner Protocol 01112022
36 pages
"Online Payment System": Tanla Platforms LTD
No ratings yet
"Online Payment System": Tanla Platforms LTD
8 pages
KCSE 2008 Agriculture P1 E
No ratings yet
KCSE 2008 Agriculture P1 E
5 pages
Dr_JoseMC_CorrelationandRegression
No ratings yet
Dr_JoseMC_CorrelationandRegression
50 pages
DeepSecure_Detection_of_Distributed_Denial_of_Service_Attacks_on_5G_Network_SlicingDeep_Learning_Approach
No ratings yet
DeepSecure_Detection_of_Distributed_Denial_of_Service_Attacks_on_5G_Network_SlicingDeep_Learning_Approach
5 pages
123 234 345 456
No ratings yet
123 234 345 456
5 pages
(Ye-Tt) Threading Tools PDF
No ratings yet
(Ye-Tt) Threading Tools PDF
176 pages
Acropolis Institute of Technology & Research, Indore: BT101 Engg Chemistry
100% (1)
Acropolis Institute of Technology & Research, Indore: BT101 Engg Chemistry
6 pages
FIB64981814495
No ratings yet
FIB64981814495
2 pages

Unit III_Full

Uploaded by

Unit III_Full

Uploaded by

20ITEL609 - NoSQL Database Techniques

UNIT III COLUMN ORIENTED DATABASE​ ​ ​ ​ 9

3.1 Column- oriented NoSQL databases using Apache HBASE

Apache HBase is a widely used, open-source, column-oriented NoSQL database that is

Key Characteristics of HBase:

1.​ Column-Oriented Storage:​

○​ HBase is designed to scale horizontally. It can handle massive amounts of

HBase Data Model:

●​ Table: Data is stored in tables, each identified by a unique name.

Benefits of Using HBase:

1.​ High Throughput:​

○​ HBase is fault-tolerant by design. It replicates data across multiple nodes,

Use Cases for HBase:

1.​ Region Server:​

○​ HBase relies on Apache ZooKeeper for coordination and management of the

Example of an HBase Table:

Apache HBase is a powerful, column-oriented NoSQL database designed for handling

3.2 Column-oriented NoSQL databases using Apache Cassandra

Apache Cassandra is another highly popular, open-source, distributed, column-oriented

Key Characteristics of Apache Cassandra:

1.​ Column-Family Data Model:​

○​ Like HBase, Cassandra uses a column-family model to organize data. A

○​ Cassandra is designed to be distributed and decentralized. Each node in a

○​ Cassandra follows an eventual consistency model (as opposed to strong

○​ Cassandra is optimized for write-heavy workloads, where it can handle a

Key Concepts in Cassandra:

○​ A keyspace is the outermost container for data in Cassandra, similar to a

2.​ Column Family:​

○​ A column family is the primary data structure in Cassandra. It is similar to a

○​ Each row in a column family is uniquely identified by a primary key, which is

○​ A column consists of a name, value, and timestamp. Unlike relational

○​ Data is replicated across multiple nodes in the cluster to ensure durability

Example of a Cassandra Data Model:

Consider an example for storing user information in a Cassandra database.

Column Family (Table):

●​ Primary Key: user_id is the primary key.

Benefits of Using Cassandra:

○​ Cassandra is designed to handle massive data volumes. It is capable of scaling

○​ By replicating data across multiple nodes (and optionally across multiple

○​ Cassandra offers tunable consistency, allowing users to adjust the

○​ Cassandra is optimized for high throughput on write-heavy workloads,

5.​ Flexible Schema Design:​

○​ Cassandra’s distributed architecture ensures that it can run on multiple

Use Cases for Cassandra:

Apache Cassandra is an excellent choice for distributed, column-oriented NoSQL database

3.3 Architecture of HBASE

Key Components of HBase Architecture:

1.​ HBase Client:​

○​ The HBase Master is the central management node in an HBase cluster,

○​ A Region Server is a key component in HBase. It serves as the actual server

○​ A Region is a horizontal partition of a table. A table is divided into multiple

○​ HBase uses a Write-Ahead Log (WAL) to ensure durability and fault

○​ Zookeeper is a critical part of the HBase architecture. It is used for

○​ The hbase:meta table is a special system table in HBase that contains

HBase Data Flow:

1.​ Write Process:​

○​ A client sends a write request (e.g., Put) to a region server.

○​ A client sends a read request (e.g., Get) to a region server.

The architecture of HBase is designed to be distributed, scalable, and fault-tolerant. It

3.4 What Is a Column-Family Data Store- Features- Consistency

Types of Consistency in HBase:

1.​ Strong Consistency (Single-row operations)​

○​ For multi-row operations, HBase may not guarantee immediate consistency

HBase Consistency Example Queries:

1. Strong Consistency (Single-Row Consistency)

Example: Put and Get Operations (Strong Consistency)

// Create a Put instance to insert data

Put put = new Put(Bytes.toBytes("row1"));

// HBase table reference (assuming 'table' is an HTable instance)

// Retrieve the data from HBase table

Get get = new Get(Bytes.toBytes("row1"));

Result result = table.get(get);

// Reading the value of "column1" from the row "row1"

byte[] value = result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("column1"));

System.out.println("Value for column1: " + Bytes.toString(value));

●​ A Put operation writes the value value1 to row1 and column1.

UNIT III COLUMN ORIENTED DATABASE 9

1. Column-Oriented Storage:

○ HBase is designed to scale horizontally. It can handle massive amounts of

● Table: Data is stored in tables, each identified by a unique name.

1. High Throughput:

○ HBase is fault-tolerant by design. It replicates data across multiple nodes,

1. Region Server:

○ HBase relies on Apache ZooKeeper for coordination and management of the

1. Column-Family Data Model:

○ Like HBase, Cassandra uses a column-family model to organize data. A

○ Cassandra is designed to be distributed and decentralized. Each node in a

○ Cassandra follows an eventual consistency model (as opposed to strong

○ Cassandra is optimized for write-heavy workloads, where it can handle a

○ A keyspace is the outermost container for data in Cassandra, similar to a

2. Column Family:

○ A column family is the primary data structure in Cassandra. It is similar to a

○ Each row in a column family is uniquely identified by a primary key, which is

○ A column consists of a name, value, and timestamp. Unlike relational

○ Data is replicated across multiple nodes in the cluster to ensure durability

● Primary Key: user_id is the primary key.

○ Cassandra is designed to handle massive data volumes. It is capable of scaling

○ By replicating data across multiple nodes (and optionally across multiple

○ Cassandra offers tunable consistency, allowing users to adjust the

○ Cassandra is optimized for high throughput on write-heavy workloads,

5. Flexible Schema Design:

○ Cassandra’s distributed architecture ensures that it can run on multiple

1. HBase Client:

○ The HBase Master is the central management node in an HBase cluster,

○ A Region Server is a key component in HBase. It serves as the actual server

○ A Region is a horizontal partition of a table. A table is divided into multiple

○ HBase uses a Write-Ahead Log (WAL) to ensure durability and fault

○ Zookeeper is a critical part of the HBase architecture. It is used for

○ The hbase:meta table is a special system table in HBase that contains

1. Write Process:

○ A client sends a write request (e.g., Put) to a region server.

○ A client sends a read request (e.g., Get) to a region server.

1. Strong Consistency (Single-row operations)

○ For multi-row operations, HBase may not guarantee immediate consistency

● A Put operation writes the value value1 to row1 and column1.

● A Put operation is performed on row2, updating column2 with the value

● Multi-Row Scans/Queries: While HBase provides strong consistency for single-row

● HBase is column-oriented—it efficiently retrieves specific columns rather than

1. Row-Level Atomicity: All updates to a single row are atomic.