Unit III_Full
Unit III_Full
Topic Page No
3.1 Column- oriented NoSQL databases using Apache HBASE 1
3.2 Column-oriented NoSQL databases using Apache Cassandra 4
3.3 Architecture of HBASE 9
3.4 What Is a Column-Family Data Store- Features- Consistency- 13
3.5 Transactions- Availability- Query Features- Scaling- Suitable Use Cases 18
3.6 Event Logging- Content Management Systems- 27
3.7 Blogging Platforms- Counters- Expiring Usage. 28
3.8 Visible surface detection methods 30
○ HBase stores data in column families, where each column is stored separately
on disk, allowing for more efficient read and write operations for certain
types of queries. This column-oriented storage model is particularly suited
for analytical workloads where access to specific columns in wide tables is
frequent.
2. Scalability:
1
20ITEL609 - NoSQL Database Techniques
nodes. The data is partitioned into regions, and these regions can be
distributed across multiple servers (region servers) as the data grows.
3. Real-Time Access:
○ HBase provides real-time read/write access to data. This makes it suitable for
applications requiring low-latency access to large amounts of data.
4. Schema-less:
○ While HBase tables do have a schema, they are more flexible than traditional
relational databases. Each row can have a different set of columns, and the
columns are organized into families. This is particularly useful when the data
model changes frequently or when data comes in varied forms.
5. Strong Consistency:
○ HBase offers strong consistency guarantees for data, meaning that once a
write operation is successful, all subsequent reads will reflect that write. This
is important for applications requiring immediate consistency.
○ HBase can handle high throughput for read and write operations, making it
suitable for real-time applications, such as web analytics, sensor data, and
user activity logs.
2. Flexible Schema Design:
○ HBase allows you to define flexible schema designs. Each row can have
different columns, and new columns can be added without downtime.
3. Scalable:
2
20ITEL609 - NoSQL Database Techniques
○ It scales linearly as data grows. When more data is added, you can add more
region servers, and HBase will distribute the load across the servers
automatically.
4. Integration with Hadoop Ecosystem:
○ Since HBase is built on top of Hadoop, it integrates well with other tools in
the Hadoop ecosystem, like Apache Hive, Apache Spark, and Apache Flume,
for big data processing, analytics, and stream processing.
5. Distributed and Fault Tolerant:
● Time-Series Data: HBase is suitable for storing and querying time-series data, such
as logs, sensor data, or event data that needs to be stored over long periods and
accessed efficiently.
● Real-Time Analytics: HBase can be used for real-time analytics of large datasets,
especially when queries need to access specific columns rather than entire rows.
● Recommendation Systems: Large-scale recommendation engines, like those used
in e-commerce or social media, can benefit from HBase's scalability and real-time
data access.
● Big Data Applications: Applications requiring the storage of large volumes of data,
such as those used in telecommunications, finance, and social media, are prime
candidates for HBase.
HBase Architecture:
○ The region server is responsible for managing regions and serving requests.
Each region is a subset of the table's data.
2. HMaster:
○ The HMaster is responsible for managing the overall cluster, including region
assignments, balancing load, and handling failures.
3. ZooKeeper:
3
20ITEL609 - NoSQL Database Techniques
Consider a table for storing user information with the following columns:
● Table: Users
○ Row Key: user_id
○ Column Family: profile
■ Columns: name, email, phone_number
○ Column Family: preferences
■ Columns: theme, language
Each user would have a row with a unique user_id, and each row would have data in the
profile and preferences column families.
Conclusion:
4
20ITEL609 - NoSQL Database Techniques
stored by column rather than by row, which allows for more efficient queries
on specific columns.
2. Distributed and Decentralized:
○ Cassandra provides linear scalability, meaning you can add more nodes to a
Cassandra cluster to handle more data and requests without affecting
performance. It is optimized for very large datasets and can scale out
horizontally across thousands of nodes.
4. High Availability and Fault Tolerance:
○ One of Cassandra's core features is its ability to offer high availability. It does
this by replicating data across multiple nodes and data centers. If a node fails,
Cassandra can still serve requests using other replicas of the data. This is
crucial for applications that cannot afford downtime.
5. Eventual Consistency:
1. Keyspace:
5
20ITEL609 - NoSQL Database Techniques
○ Partition Key: The partition key is used to distribute data across different
nodes in the cluster. All rows with the same partition key will be stored
together on the same node.
○ Clustering Key: The clustering key determines the order in which rows are
stored within the same partition, which allows for efficient range queries.
Keyspace:
CREATE KEYSPACE user_data WITH REPLICATION = {'class' : 'SimpleStrategy',
'replication_factor' : 3};
6
20ITEL609 - NoSQL Database Techniques
first_name TEXT,
last_name TEXT,
email TEXT,
phone_number TEXT,
preferences MAP<TEXT, TEXT>
);
In this table:
1. Scalability:
7
20ITEL609 - NoSQL Database Techniques
○ While Cassandra has a schema, it allows flexibility for data modeling. Each
row can have different columns, and columns can be added dynamically. This
is beneficial when the data structure evolves over time.
6. Distributed Architecture:
● Real-Time Big Data Applications: Due to its ability to handle massive data volumes
and provide low-latency read/write operations, Cassandra is ideal for real-time big
data use cases like social media analytics, recommendation engines, and sensor data.
● Time-Series Data: Cassandra’s ability to scale horizontally and handle high-write
loads makes it suitable for time-series data applications such as IoT devices or stock
market data.
● Event Logging and Monitoring: Applications that need to store and analyze large
volumes of event or log data benefit from Cassandra’s write optimization and high
availability.
● Content Management Systems: Systems that store large, distributed datasets, like
those used for managing digital content (videos, articles, images), can benefit from
Cassandra's ability to handle large, unstructured data.
Cassandra vs HBase:
● Data Model: While both HBase and Cassandra use a column-family model,
Cassandra is more flexible in terms of schema design and can handle a wider variety
of workloads. HBase typically requires more upfront schema definition, while
Cassandra allows for more dynamic changes.
● Consistency Model: HBase offers strong consistency by default, while Cassandra
uses eventual consistency but allows the user to tune consistency levels.
● Cluster Management: HBase relies on a master-slave architecture, while
Cassandra’s peer-to-peer model is fully decentralized.
● Performance: Cassandra is often more suited for high-throughput write-heavy
applications, while HBase is more focused on large-scale read-heavy workloads.
Conclusion:
8
20ITEL609 - NoSQL Database Techniques
Apache HBase is a distributed, column-oriented NoSQL database that is part of the Hadoop
ecosystem. It is designed to handle large-scale, sparse datasets across a cluster of
commodity servers. HBase is inspired by Google Bigtable and provides real-time random
access to large datasets, scaling horizontally to meet the needs of big data applications.
○ The HBase client is used by applications to interact with the HBase cluster.
Clients can perform CRUD (Create, Read, Update, Delete) operations on
HBase tables. Clients can directly interact with the region servers or use the
HBase API to connect and query data from the cluster.
2. HBase Master:
9
20ITEL609 - NoSQL Database Techniques
○ Data in HBase is stored in HFiles, which are essentially files stored in HDFS.
○ HFiles are immutable files that store the data of a region. Once data is
written to an HFile, it is not modified. New data is written to a MemStore,
and periodically, the contents of the MemStore are flushed to a new HFile on
HDFS.
○ HFiles are efficient for large, sequential reads and write operations.
6. MemStore:
○ The MemStore is an in-memory buffer used for writing data to HBase. When
a write request (like a put operation) is made, data is first stored in the
MemStore, which exists in memory.
○ When the MemStore reaches a certain threshold (configured by the
hbase.hregion.memstore.flush.size property), its contents are flushed to disk
as an HFile in HDFS. This process is called memstore flushing.
10
20ITEL609 - NoSQL Database Techniques
○ After flushing, the MemStore is cleared, and the new data is available for
future writes or reads.
7. Write-Ahead Log (WAL):
10.Compaction:
● Compaction is the process of merging smaller HFiles into larger ones to optimize
storage and performance. Over time, as data is written and flushed to disk, multiple
small HFiles are created. As more data is inserted, the number of HFiles increases,
which can degrade performance.
11
20ITEL609 - NoSQL Database Techniques
● Minor Compaction: A background process that merges smaller HFiles in the same
region into a larger file.
● Major Compaction: A more intensive process that consolidates HFiles and removes
deleted or outdated data (tombstones).
12
20ITEL609 - NoSQL Database Techniques
In HBase, consistency refers to the system’s ability to ensure that data remains consistent
across all nodes, especially in a distributed environment. HBase primarily offers strong
consistency for single-row operations but can provide various consistency guarantees for
multi-row operations, depending on how the system is configured and the consistency level
chosen.
13
20ITEL609 - NoSQL Database Techniques
○ HBase provides strong consistency for read and write operations on a single
row. This means that once a write is acknowledged, any subsequent read
from the same row will reflect the latest data, regardless of which region
server is accessed.
2. Eventual Consistency (Multi-row operations)
For single-row operations, HBase guarantees strong consistency. Once a write to a row is
successful, any subsequent reads will return the most recent data for that row.
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column1"),
Bytes.toBytes("value1"));
table.put(put);
14
20ITEL609 - NoSQL Database Techniques
In this example:
After the write, all subsequent reads (on the same row) will see the most recent value (e.g.,
"value1") and will reflect the same data consistently across different regions or region
servers.
HBase operates with strong consistency for single-row operations, but when it comes to
multi-row or cross-region queries (like scans spanning across regions), consistency can
become eventual. This means the system may return stale data for rows that were recently
updated or involve multiple replicas.
scan.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column1"));
scanner.close();
In this example:
● The Scan operation retrieves multiple rows and columns from HBase.
15
20ITEL609 - NoSQL Database Techniques
● If these rows are distributed across different regions or servers, there may be slight
delays in replicating data across the cluster.
● If recent updates were made to some of the rows, you may experience eventual
consistency, where the data might not be up-to-date on all replicas immediately.
While HBase does not support full ACID transactions (like relational databases), it ensures
atomicity for operations within a single row. This means that any updates to a row
(whether Put, Delete, or other operations) will be applied atomically to that row.
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("column2"),
Bytes.toBytes("newValue2"));
table.put(put);
In this example:
16
20ITEL609 - NoSQL Database Techniques
When performing scans or operations across multiple rows, HBase does not guarantee
immediate consistency. In cases of network partitions or node failures, different region
servers might have inconsistent versions of the data, which can result in eventual
consistency.
scanner.close();
In this example:
● A Scan operation is done across multiple rows within the column family "cf".
● Since this operation spans multiple region servers or regions, the data read might be
eventually consistent, meaning it may not reflect the most recent updates for some
rows, especially if those rows are being replicated or have just been written.
● Single-Row Operations: For single-row reads and writes, HBase ensures strong
consistency. Therefore, any Put, Get, or Delete operations on a single row will
always be consistent.
17
20ITEL609 - NoSQL Database Techniques
● Write-Ahead Log (WAL): For durability and consistency, writes are first logged in
the WAL before being applied to the MemStore and HFiles. This ensures that even if
a region server crashes, data can be recovered to the last consistent state from the
WAL.
● Versioning: HBase supports versioning of cells. When a row is updated, the older
versions are retained based on the configured versioning policy. This allows you to
query the row's previous versions, ensuring that the system maintains consistency
in historical data.
Row Key-based Queries are optimized for retrieving Retrieving a user’s profile
Queries data based on the row key. Efficient based on user ID.
for single row lookups.
Range Queries Supports range scans based on row Fetching all rows for a
keys. Efficient for querying a range of particular time period (e.g.,
rows sorted by row key. logs from a specific date
range).
Scan Allows scanning of the entire table or a Scanning rows for a specific
Operations subset of rows based on conditions like range, e.g., all products from
start/stop row key. row1 to row100.
18
20ITEL609 - NoSQL Database Techniques
No Joins HBase does not support joins like For complex relationships,
relational databases. Queries must be handle joins outside HBase
done separately or handled by external (e.g., in a data processing
tools like Apache Phoenix or Apache layer).
Hive.
Partial Scans Allows partial scans to filter specific Scanning only a subset of
rows or columns within a range, rows within a specific range
improving scan efficiency. of row keys.
Throttling Supports limiting the amount of data Limiting the number of rows
Queries returned in a query to avoid retrieved in a scan or query
overwhelming the client or server. to avoid system overload.
Got it! Let’s use Indian names in the examples. Here’s a more realistic example of querying
in HBase using an Indian student database.
19
20ITEL609 - NoSQL Database Techniques
name age
Let’s get all details of student Rajesh Kumar (Row Key: 201):
Output:
COLUMN CELL
personal:name timestamp=..., value=Rajesh Kumar
personal:age timestamp=..., value=15
academics:grade timestamp=..., value=10
academics:marks timestamp=..., value=88
Output:
COLUMN CELL
academics:marks timestamp=..., value=92
scan 'students'
Output:
20
20ITEL609 - NoSQL Database Techniques
ROW COLUMN+CELL
201 column=personal:name, timestamp=..., value=Rajesh Kumar
201 column=personal:age, timestamp=..., value=15
201 column=academics:grade, timestamp=..., value=10
201 column=academics:marks, timestamp=..., value=88
...
If we want to fetch all students whose Row Key starts with "20" (e.g., 201, 202, 203):
Output:
ROW COLUMN+CELL
201 column=personal:name, timestamp=..., value=Rajesh Kumar
202 column=personal:name, timestamp=..., value=Priya Sharma
203 column=personal:name, timestamp=..., value=Arjun Reddy
...
Output:
ROW COLUMN+CELL
202 column=personal:name, timestamp=..., value=Priya Sharma
HBase does not support traditional ACID transactions like relational databases (e.g.,
MySQL, PostgreSQL). However, it provides atomicity at the row level and some
mechanisms to control data consistency. Here are the key transaction control mechanisms:
21
20ITEL609 - NoSQL Database Techniques
HBase ensures that all column families of a row are updated atomically.
If we update marks and grade in the same row, HBase ensures atomicity.
HBase guarantees that both operations will not be partially written—either both succeed
or none.
22
20ITEL609 - NoSQL Database Techniques
HBase provides an atomic increment method to increase numerical values without race
conditions.
Key Takeaways
● HBase does not support full ACID transactions but provides atomic operations at
the row level.
● Check-and-Put/Delete ensures conditional updates, preventing race conditions.
● Increment is atomic, useful for counters.
● Batch mutations improve performance by reducing network calls.
23
20ITEL609 - NoSQL Database Techniques
Real-time Big Data HBase handles massive data ingestion and fast queries over
Analytics petabytes of data.
Log Data Storage and Suitable for storing and analyzing logs from servers,
Analysis applications, and IoT devices.
Time-Series Data Efficient for handling sensor data, stock market prices, or
system performance metrics.
Social Media & Stores user interactions, chat messages, and feed data for
Messaging scalability.
Geospatial Data Stores and queries large-scale geospatial datasets (e.g., GPS
Processing tracking).
IoT Data Storage Handles continuous data streams from IoT devices and
sensors.
Enterprise Data Lakes Works as a NoSQL backend for massive-scale data lakes.
HBase is ideal for write-heavy workloads, real-time read/write access, and large-scale
distributed applications. However, it’s not suitable for transactional systems requiring
strong ACID guarantees.
Here are some HBase query examples using an Employee database with Indian data.
We'll cover basic operations like creating a table, inserting data, retrieving data, and
filtering.
24
20ITEL609 - NoSQL Database Techniques
An employee table typically has columns like emp_id, name, designation, department,
salary, and location.
2. Inserting Data
25
20ITEL609 - NoSQL Database Techniques
3. Retrieving Data
COLUMN CELL
scan 'employee'
26
20ITEL609 - NoSQL Database Techniques
count 'employee'
HBase, a distributed, scalable, and NoSQL database that runs on Hadoop, is widely used for
handling large datasets and real-time analytics. When integrated with Content Management
Systems (CMS), HBase provides an efficient solution for event logging, ensuring high
availability and fault tolerance.
In a CMS, event logging is essential for tracking user activities, monitoring system
performance, and maintaining security. HBase's capability to handle massive amounts of
structured and semi-structured data makes it an ideal choice for storing event logs.
1. Scalability: HBase can handle petabytes of data across distributed clusters, making it
suitable for large-scale CMS platforms with high user activity.
2. Real-time Data Ingestion: HBase supports real-time write operations, allowing
instant logging of events like user logins, content updates, and access control
changes.
3. Schema Flexibility: HBase allows dynamic schema evolution, which is crucial for
logging different types of events with varying attributes.
4. Fault Tolerance: Built on Hadoop's HDFS, HBase ensures data durability and fault
tolerance, preventing data loss during failures.
Advantages:
27
20ITEL609 - NoSQL Database Techniques
In conclusion, HBase provides a robust and scalable solution for managing event logs in a
CMS, enhancing security, performance monitoring, and compliance.
HBase, a NoSQL distributed database built on Hadoop, is highly effective for managing
counters and expiring usage data in blogging platforms. Counters are essential for tracking
metrics like post views, likes, and comments, while expiring usage helps manage temporary
data like session activity and user engagement.
HBase provides an increment operation that allows atomic counter updates without
read-modify-write cycles. This is particularly useful for tracking:
● Post views
● Number of likes or comments
● User visits and engagement time
How it Works:
Example Schema:
Table: blog_metrics
Row Key: post_id
Column Family: counters
Column: views, likes, comments
Increment Command:
28
20ITEL609 - NoSQL Database Techniques
In a blogging platform, certain data (like session logs, temporary comments, or user
activity) should expire after a certain period. HBase provides Time-to-Live (TTL)
functionality to handle this.
How it Works:
This command sets the TTL to 24 hours (86400 seconds). Any data older than this will be
automatically deleted.
Conclusion
HBase's atomic counters and TTL-based expiring usage make it a powerful choice for
real-time analytics and efficient data management in blogging platforms. This approach
ensures scalability while automatically handling data expiration to optimize storage.
29
20ITEL609 - NoSQL Database Techniques
In the context of Visible Surface Detection (VSD), which is a critical concept in computer
graphics for determining which surfaces are visible to the viewer, integrating HBase can
provide a scalable and distributed approach for managing large datasets and performing
computations efficiently.
HBase itself is not directly used for rendering graphics, but it plays a vital role in storing,
managing, and analyzing surface data and depth information when handling large 3D
models or scenes.
● HBase can store large-scale 3D mesh data, depth maps, and z-buffer information.
● It supports real-time updates on surface visibility changes.
● It allows parallel processing of surface data for faster computations.
The Z-Buffer algorithm, a popular method for visible surface detection, relies on storing
depth values for each pixel.
30
20ITEL609 - NoSQL Database Techniques
Back-face culling is a technique to eliminate surfaces that are not visible to the camera.
HBase can be used to store surface normals and view vectors and perform filtering on
the backend.
In ray-casting, rays are traced from the camera to intersect with surfaces. HBase can store
ray intersection data and help in parallel processing for large 3D environments.
HBase's TTL (Time-To-Live) feature can automatically remove outdated visibility data. For
example, temporary visibility data in dynamic environments can be set to expire after a
certain time.
Conclusion
HBase provides a scalable and distributed platform for handling large-scale Visible
Surface Detection (VSD) data. It can manage depth buffers, back-face culling data, and
ray intersection information, making it suitable for real-time rendering and 3D
graphics analysis in cloud environments.
31