0% found this document useful (0 votes)
4 views75 pages

NoSQL Database Comprehensive Report

The document provides a comprehensive overview of NoSQL databases, detailing the nature of unstructured data, its advantages and disadvantages, and the key features of NoSQL systems. It categorizes NoSQL databases into types such as Key-Value Stores, Document Databases, and Column-Family Stores, explaining their data models, storage mechanisms, and working principles. Additionally, it discusses challenges like the 'Attack of Clusters' and impedance mismatch, highlighting the historical context and evolution of NoSQL in response to the demands of modern applications.

Uploaded by

a1exhe1es00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views75 pages

NoSQL Database Comprehensive Report

The document provides a comprehensive overview of NoSQL databases, detailing the nature of unstructured data, its advantages and disadvantages, and the key features of NoSQL systems. It categorizes NoSQL databases into types such as Key-Value Stores, Document Databases, and Column-Family Stores, explaining their data models, storage mechanisms, and working principles. Additionally, it discusses challenges like the 'Attack of Clusters' and impedance mismatch, highlighting the historical context and evolution of NoSQL in response to the demands of modern applications.

Uploaded by

a1exhe1es00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

COMPREHENSIVE GUIDE TO NOSQL

DATABASES AND APPLICATIONS


UNIT 1: INTRODUCTION TO NOSQL
WHAT IS UNSTRUCTURED DATA? ADVANTAGES AND
DISADVANTAGES

Unstructured data refers to information that does not reside in a traditional


row-column database or follow a fixed schema. It lacks a predefined data
model, making it difficult to organize in traditional relational databases which
require rigid structures like tables and predefined data types. Examples of
unstructured data include text documents, emails, social media posts,
images, audio files, video content, sensor data, web pages, and logs.

While unstructured data is pervasive and represents the vast majority of data
generated today, extracting meaningful insights from it requires different
tools and techniques compared to structured data.

Advantages of Unstructured Data:

• Flexibility: It accommodates diverse types of information without


requiring upfront schema definition or modification. This is particularly
useful in scenarios where data formats are constantly evolving or are
highly varied.
• Richness: Unstructured data often contains more detailed and nuanced
information than structured data, capturing context, sentiment, and
complex relationships that are difficult to fit into predefined fields.
• Volume: The sheer volume of unstructured data available from sources
like the internet, social media, and sensors provides a massive source of
potential insights for analytics, machine learning, and business
intelligence.
• Real-world Representation: Many real-world data sources, such as
documents, emails, and multimedia, are inherently unstructured,
making this format a natural way to capture and store such information.
Disadvantages of Unstructured Data:

• Difficulty in Processing and Analysis: Due to the lack of a fixed schema,


analyzing unstructured data is significantly more complex than
analyzing structured data. Traditional SQL queries are not directly
applicable. It often requires advanced techniques like natural language
processing (NLP), machine learning, and complex pattern matching.
• Storage and Management Challenges: While flexible, storing and
managing vast amounts of unstructured data efficiently can be
challenging. Indexing and searching within unstructured data require
specialized technologies and are often less precise or efficient than with
structured data.
• Higher Cost and Complexity: Extracting value from unstructured data
typically involves higher costs and more complex infrastructure (e.g.,
data lakes, specialized processing engines) and expertise compared to
relational databases.
• Data Quality Issues: Ensuring data quality, consistency, and accuracy is
harder in unstructured data because there are no predefined rules or
constraints enforced by a schema.

The rise of big data and the need to process massive volumes of diverse
information have highlighted both the importance and the challenges of
working with unstructured data, driving the development of technologies like
NoSQL databases that are better suited to handle its inherent variability.

FEATURES OF NOSQL DATABASES

NoSQL (Not Only SQL) databases emerged as an alternative to traditional


relational databases to address the limitations faced when handling large
volumes of unstructured or semi-structured data, high velocity, and high
variability, particularly in distributed environments. Key features of NoSQL
databases include:

• Schema-less or Flexible Schema: Unlike relational databases that


require a predefined and often rigid schema, NoSQL databases offer
flexible schemas. This means data can be added or modified without
changing the entire database structure, making development cycles
faster and accommodating evolving data requirements.
• Horizontal Scalability: NoSQL databases are designed for horizontal
scaling, meaning they can handle increasing amounts of data and traffic
by adding more servers or nodes to the database cluster. This is often
simpler and less expensive than the vertical scaling (upgrading existing
hardware) typically used for relational databases.
• High Availability: Many NoSQL databases are built with replication and
distribution in mind, ensuring that data is available even if some nodes
in the cluster fail.
• Eventual Consistency (often): While some NoSQL databases offer strong
consistency, many prioritize availability and partition tolerance over
immediate consistency (following the CAP theorem). This often leads to
eventual consistency, where data changes propagate through the
system over time, and queries might not immediately reflect the latest
writes across all nodes.
• Optimized for Specific Data Models: Different types of NoSQL databases
are optimized for particular data structures and access patterns (e.g.,
key-value pairs, documents, column families, graphs), making them
highly performant for specific use cases.
• BASE Properties (often): Many NoSQL systems adhere to the BASE
(Basically Available, Soft state, Eventually consistent) properties,
contrasting with the ACID (Atomicity, Consistency, Isolation, Durability)
properties common in traditional relational databases. BASE emphasizes
availability and flexibility over strict consistency.

These features make NoSQL databases well-suited for modern web


applications, mobile apps, big data analytics, content management systems,
and real-time systems that demand high scalability, flexibility, and
performance with varying data structures.

TYPES OF NOSQL DATABASES, DATA STORAGE MECHANISMS, AND


WORKING PRINCIPLES

NoSQL databases are broadly categorized based on their data model and
storage mechanism. The four primary types are Key-Value Stores, Document
Databases, Column-Family Stores, and Graph Databases (though the prompt
focuses on the first three and aggregate-oriented stores which encompass
the first three in a way).

Key-Value Stores

Data Model: The simplest NoSQL data model. Data is stored as a collection of
key-value pairs. The key is a unique identifier, and the value is typically an
opaque blob of data (string, number, complex object, etc.) that the database
does not inspect or understand. The value can be any data type.
Storage Mechanism: Data is typically stored on disk or in memory, often
using hash tables or similar structures for quick lookup based on the key.
Distribution is commonly achieved by hashing keys to determine which node
stores the corresponding value.

Working Principles: Operations are straightforward:

• PUT(key, value) : Store or update the value associated with a given


key.
• GET(key) : Retrieve the value associated with a given key.
• DELETE(key) : Remove the key-value pair.

Queries are limited to retrieving data based on the key. There are typically no
complex query capabilities or relationships between values. This simplicity
allows for extremely high performance and scalability.

Diagram Concept: Imagine a simple lookup table or dictionary. On the left, a


column of unique keys (e.g., "user:123", "product:abc"). On the right, a
column of corresponding values (e.g., a JSON string representing user data, a
serialized object representing product info). Arrows show lookup from key to
value. This structure is distributed across multiple servers, with keys mapped
to specific servers.

Document Databases

Data Model: Data is stored in documents, typically in formats like JSON,


BSON, or XML. Each document is self-contained and can have a nested
structure. Documents within the same collection can have different fields,
offering flexibility (though usually sharing a similar overall structure). Each
document has a unique identifier.

Storage Mechanism: Documents are typically stored on disk. Indexes can be


created on fields within documents to facilitate faster querying beyond just
the document ID. Sharding is common, distributing collections or documents
across multiple nodes.

Working Principles: Operations include:

• Inserting, updating, and deleting documents based on their ID.


• Querying documents based on the values of fields within the
documents. This allows for richer querying than key-value stores,
including range queries, equality matches, and sometimes basic
aggregation.
• Indexes help speed up these queries.

This model is well-suited for content management, catalogs, and user profiles
where data structures can vary slightly from one entry to the next.

Diagram Concept: Picture collections (like folders) containing individual


documents (like files). Each document is represented as a box with fields and
their values listed inside (e.g., {"_id": "doc1", "name": "Alice", "age": 30}). Show
different documents in the same collection having potentially different fields
(e.g., another document: {"_id": "doc2", "name": "Bob", "city": "New York"}).
Show how queries can select documents based on field values across the
collection.

Column-Family Stores (Wide-Column Stores)

Data Model: Data is organized into rows, but instead of fixed columns like in
relational databases, rows contain "column families." Within a column family,
data is stored as columns which can vary from row to row. This structure
allows for efficient storage and retrieval of sparse data and wide rows.

Storage Mechanism: Data is stored column by column rather than row by row
within a column family on disk. This is different from traditional row-oriented
storage. Keys identify rows, and column families group related columns. Data
within a column family for a specific row is typically stored together.

Working Principles:

• Data is accessed by row key and then by column family and column
name.
• Efficient for reading and writing specific columns or sets of columns
across many rows.
• Well-suited for time-series data, event logging, and large analytical
datasets where queries often focus on subsets of attributes across many
records.

Examples include Apache HBase and Apache Cassandra.

Diagram Concept: Represent a table structure. Each row has a unique Row
Key. Columns are grouped into Column Families (e.g., "Personal Info",
"Contact Info"). Within a Column Family for a given Row Key, show individual
columns (e.g., "Personal Info: Name", "Personal Info: Age", "Contact Info:
Email"). Crucially, emphasize that not all rows need to have the same columns
within a family, and values for columns are stored contiguously within their
column family.

Note: Aggregate-Oriented Databases is often used as a term encompassing


Key-Value, Document, and Column-Family stores because they store data in
aggregates (like documents or rows with their associated columns) that can
be retrieved efficiently, contrasting with graph databases which focus on
relationships between individual entities.

THE TERM "ATTACK OF CLUSTERS" AND ITS APPLICABILITY IN


NOSQL DATABASES

The term "Attack of Clusters" (sometimes referred to in contexts of distributed


systems or network partitions) can be interpreted in several ways depending
on the specific context, but in the context of distributed NoSQL databases, it
most likely refers to the challenges and complexities that arise when
managing and maintaining a large, distributed cluster of database nodes. It
highlights potential failure modes and management overheads inherent in
such systems.

Specifically, in a large NoSQL cluster designed for horizontal scalability and


high availability across many nodes, potential issues include:

1. Network Partitions: When the network splits the cluster into multiple
segments, and nodes in different segments cannot communicate. This is
a core challenge addressed by the CAP theorem. How the database
behaves during a partition (prioritizing availability or consistency) is
critical.
2. Node Failures: Individual nodes can fail (hardware issues, software
crashes). The cluster needs mechanisms (like replication) to detect
failures, failover to replicas, and recover gracefully without losing data
or becoming unavailable.
3. Split-Brain Scenarios: A specific type of network partition where
different parts of the cluster believe they are the authoritative source for
certain data, leading to conflicting updates and data inconsistencies if
not handled properly.
4. Operational Complexity: Managing hundreds or thousands of nodes,
monitoring their health, performing upgrades, and rebalancing data
across the cluster adds significant operational overhead.
5. Data Inconsistency: In systems prioritizing availability during partitions
(AP systems under CAP), different parts of the cluster might serve stale
data, leading to temporary inconsistencies that must eventually be
resolved.

The "Attack of Clusters" metaphor emphasizes that while clustering provides


scalability and resilience benefits, it also introduces new, complex failure
modes and management challenges that don't exist in single-instance
databases. It requires careful design, robust fault tolerance mechanisms, and
sophisticated operational practices.

Diagram Concept: A diagram illustrating this could show multiple


interconnected nodes representing a cluster. Some nodes or network links
could be highlighted as failing or partitioned. Arrows could show data flow
and replication under normal conditions vs. blocked or conflicting flows
during failures or partitions. An "X" or a broken line over communication
paths or nodes would visually represent the "attack" or failure points within
the cluster environment.

IMPEDANCE MISMATCH IN NOSQL DATABASES: WORKING AND


REQUIREMENT

Object-Relational Impedance Mismatch is a long-standing problem in


software development that arises when attempting to map an object-oriented
domain model (used in application code) to a relational database schema.
Objects have complex relationships, inheritance, and methods, while
relational databases store data in flat tables with fixed columns. Mapping
objects to rows and tables requires complex Object-Relational Mapping (ORM)
layers, which can be cumbersome, perform poorly, and fail to perfectly
represent the object model's nuances.

NoSQL databases, particularly Document and Graph databases, can


significantly reduce or eliminate this impedance mismatch.

Working Principle in NoSQL:

Document databases, for instance, store data in hierarchical documents (like


JSON). This structure often aligns much more closely with the structure of
objects in object-oriented programming languages. An object can often be
directly serialized into a document, and a document can be deserialized into
an object, with minimal transformation. Nested objects within the application
model can be represented as nested structures within the document. This
makes writing and reading data simpler and more direct compared to
decomposing an object structure into multiple related tables in an RDBMS.
Graph databases, on the other hand, store data as nodes and relationships,
which maps naturally to object models where objects are nodes and their
references to other objects are relationships. This is particularly effective for
highly connected data.

Requirement/Necessity of Addressing Impedance Mismatch:

Reducing impedance mismatch is important for several reasons:

• Development Speed: Developers spend less time writing complex


mapping code (ORMs or manual SQL mapping) and more time on core
application logic. This accelerates the development process.
• Code Simplicity: The application code interacting with the database
becomes simpler and more intuitive as the database structure more
closely mirrors the application's object model.
• Performance: Retrieving an entire object often involves fetching a single
document or traversing a graph structure directly, potentially avoiding
the need for complex JOIN operations across multiple tables that are
common and often performance bottlenecks in RDBMS when retrieving
complex object graphs.
• Maintainability: Simpler code is easier to understand and maintain.
Changes to the object model are often easier to accommodate in a
flexible-schema NoSQL database than in a rigid relational schema.

Diagram Concept: The diagram should visually contrast the mapping


complexity.

1. RDBMS Mapping (High Mismatch): Show an object model (e.g., a "User"


object with nested "Address" and a list of "Orders"). Next to it, show a
relational schema with separate "Users", "Addresses", and "Orders"
tables, linked by foreign keys. Arrows would show the complex process
of decomposing the object into rows across tables and reconstructing it
via joins. Indicate the "mismatch" or "barrier" between the object layer
and the relational layer.
2. NoSQL (Document DB) Mapping (Low Mismatch): Show the same "User"
object model. Next to it, show a single JSON document representing the
User, with nested Address object and an array of Order objects within
the same document. Arrows would show a much more direct, almost
one-to-one mapping between the object and the document structure.
Indicate a much smaller or non-existent "mismatch" barrier.
The diagram should visually convey that NoSQL data models like document
stores often align more naturally with application object models than
relational tables do.

SHORT NOTES ON THE HISTORY OF NOSQL

The term "NoSQL" was first used in 2007 by Johan Oskarsson to name a
relational database that did not expose a SQL interface. However, the term
gained prominence around 2009 as a way to categorize a growing number of
non-relational, distributed data stores that were being developed to meet the
needs of web-scale applications.

Several factors contributed to the rise of NoSQL:

• The Internet and Web-Scale Applications: Companies like Google,


Amazon, and Facebook faced unprecedented challenges with data
volume, velocity, and variety that traditional relational databases
struggled to handle efficiently or affordably at massive scale. They
needed databases that could scale horizontally across cheap commodity
hardware.
• The Need for Flexibility: Rapid development cycles and evolving data
structures in web applications demanded databases with more flexible
schemas than rigid RDBMS.
• Open Source Movement: Many early NoSQL databases were developed
as open-source projects, fostering innovation and adoption.
• Specific Data Needs: Certain types of data and access patterns (e.g., key-
value lookups, document storage, graph traversal) were more naturally
and efficiently handled by specialized data models than by fitting them
into the relational model.

Early examples and influences include:

• Google's Bigtable (2004): A distributed storage system for structured


data, influencing later column-family stores like HBase and Cassandra.
• Amazon's Dynamo (2007): A highly available key-value store that
influenced Riak and Cassandra's design principles (e.g., eventual
consistency, consistent hashing).
• IBM's Lotus Notes (pre-web): Though not considered "NoSQL" at the
time, its document-oriented approach influenced later document
databases.
• Open Source Projects: Projects like Memcached (key-value cache),
CouchDB (document database), MongoDB (document database),
Cassandra (column-family store), and HBase (column-family store built
on Hadoop) gained popularity in the late 2000s and early 2010s,
solidifying the NoSQL movement.

Initially, NoSQL was seen by some as a replacement for RDBMS, but the
prevailing view now is that NoSQL databases are complementary to relational
databases. They are best suited for specific use cases where their strengths in
scalability, flexibility, and performance for certain data models outweigh the
benefits of the relational model (like strong consistency and complex
transaction support).

UNIT 2: ADVANCED CONCEPTS AND COMPARISONS


IN NOSQL
DIFFERENCE BETWEEN RELATIONAL DATABASES AND NOSQL
DATABASES

Relational Databases (RDBMS) and NoSQL databases represent


fundamentally different approaches to data storage, organization, and
retrieval. RDBMS, based on the relational model introduced by E.F. Codd in
1970, store data in tables with predefined schemas and enforce relationships
through foreign keys. NoSQL databases, conversely, encompass a variety of
database technologies designed to address the limitations of RDBMS for
specific use cases, particularly those involving large volumes of unstructured
or semi-structured data, high throughput, and horizontal scalability.

Here is a comparison of key differences:

Relational Databases
Feature NoSQL Databases
(RDBMS)

Stores data in tables with


rows and columns; Diverse models: Key-Value, Document,
Data Model relationships defined by Column-Family, Graph. Flexible or schema-
foreign keys. Rigid schema less design common.
required.

Strict, predefined schema;


changes require altering Flexible or dynamic schema; easier to adapt
Schema
table structure (often to evolving data structures.
impacting availability).

Primarily Vertical Scaling


Scalability
(scaling up by adding more
Relational Databases
Feature NoSQL Databases
(RDBMS)

resources to a single server).


Primarily Horizontal Scaling (scaling out).
Horizontal scaling (scaling
Designed to distribute data and load across
out by adding more servers)
many servers easily.
is complex and expensive.

Strong support for ACID Varies by type and implementation. Many


(Atomicity, Consistency, prioritize availability (BASE - Basically
Isolation, Durability) Available, Soft state, Eventually consistent).
ACID
properties, ensuring data ACID transactions across multiple
Transactions
integrity for complex operations or documents are often limited
transactions across multiple or non-existent, though some support ACID
tables. within a single record/document.

Querying methods are database-specific;


SQL (Structured Query can be API-based, query languages like
Language), a powerful and MapReduce, or specific languages (e.g.,
Query
standardized query language MongoDB Query Language, CQL for
Language
for complex joins and Cassandra, Gremlin/Cypher for Graph DBs).
aggregations. Joins are typically handled in the
application layer.

Best suited for structured


Data Excellent for unstructured, semi-structured,
data with well-defined
Structure and rapidly changing data.
relationships.

Typically prioritize
Consistency and Durability Often prioritize Availability and Partition
(CP or CA under CAP Tolerance (AP under CAP theorem), leading
Availability &
theorem). Can be less to eventual consistency. Strong consistency
Consistency
available during partitions or is available in some systems or
failures without complex configurations but may impact availability.
setups.

Mature technology with


Rapidly evolving; tooling and support
Maturity & established tools, skilled
ecosystem vary greatly between specific
Ecosystem workforce, and strong
databases. Requires specialized skills.
community support.

Can be expensive to scale


Often more cost-effective for large-scale
Cost vertically, especially with
distribution using commodity hardware.
proprietary systems.

In essence, RDBMS excels where data structure is stable, data integrity is


paramount, and complex transactional consistency is required. NoSQL
databases are preferred for modern applications requiring massive scale,
flexibility, handling diverse data types, and high availability, often accepting
eventual consistency trade-offs.

COMPARISON OF POPULAR NOSQL DATABASES: MONGODB,


CASSANDRA, AND HBASE

MongoDB, Apache Cassandra, and Apache HBase are three prominent NoSQL
databases, each representing a different category (Document, Column-
Family, and Column-Family/Key-Value hybrid built on Hadoop) and optimized
for different workloads and use cases.

Feature MongoDB Apache Cassandra Apache HBase

Type/Data Document Store (JSON/ Wide-Column Store Column-Family Store


Model BSON documents) (Column Families) (built on HDFS)

Time series data,


Big Data analytics,
event logging, IoT
sparse datasets, time-
Content management, data, messaging
series data, operational
catalogs, user profiles, systems, applications
Primary Use data store for Hadoop
mobile apps, real-time needing high write
Cases ecosystem,
analytics, applications throughput and
applications needing
needing flexible schemas. continuous
strong consistency
availability across
within rows.
data centers.

Tunable consistency
(strong consistency by
Strong Consistency
default for single Eventual Consistency
within a row. Prioritizes
operations, eventual (tunable consistency
consistency and
Consistency consistency for reads levels per operation).
partition tolerance over
Model from replicas unless Prioritizes availability
availability during
specified). Supports ACID and partition
network partitions (CP
transactions across tolerance.
under CAP).
multiple documents
within a replica set.

Horizontal scaling via Horizontal scaling via


Horizontal scaling via Consistent Hashing RegionServers
Sharding (distributes data (distributes data managing data
Scalability
ranges across shards/ across nodes in a Regions on HDFS.
servers). ring). Masterless Master-based
architecture. architecture (HMaster).

Rich query language Cassandra Query API-based access (Java,


(MongoDB Query Language (CQL), etc.). Optimized for
Querying Language) allowing similar to SQL for row-key lookups and
queries on fields, ranges, basic operations. column-family/column
Feature MongoDB Apache Cassandra Apache HBase

Optimized for row-key


text search, geospatial lookups and range
scans within a row.
data, aggregation scans. Limited
Filter-based scans
framework. Secondary complex querying
supported.
indexes supported. and secondary
indexes.

Masterless peer-to-
Master-slave
Master-replica sets for peer distributed
architecture with
high availability; Config system. Data is
Architecture HMaster managing
servers and Mongos replicated across
RegionServers, which
routers for sharding. multiple nodes based
store data on HDFS.
on replication factor.

Extreme write
Strong consistency per
Flexible schema, throughput,
row, tight integration
developer-friendly (JSON continuous
with Hadoop/HDFS
documents map well to availability, designed
Strengths ecosystem, good for
objects), rich querying for multi-data center
sparse data and
capabilities, good for replication, no single
random reads/writes
diverse data. point of failure
on large datasets.
(masterless).

Can consume more Eventual consistency Relies on HDFS (adds


storage due to document by default may not complexity), higher
overhead, sharding suit all applications, latency than in-
setup/management can limited querying memory or pure disk-
Limitations
be complex, eventual capabilities, complex based stores for simple
consistency for cross- data modeling lookups, limited
shard transactions compared to querying outside row/
without complex setups. documents. column access.

Choosing between these depends heavily on the specific application's


requirements: MongoDB for flexibility and rich queries on document-like
data, Cassandra for massive write scaling and always-on availability across
locations, and HBase for strong consistency within rows and integration with
the Hadoop Big Data ecosystem.

CHALLENGES OF NOSQL DATABASES

While NoSQL databases offer significant advantages in scalability, flexibility,


and performance for certain workloads, they also present several challenges
that need to be considered:

• Data Consistency Issues: Many NoSQL databases prioritize availability


and partition tolerance, often resulting in eventual consistency. This
means that after a write, different parts of the system may temporarily
have different values for the same data item. Developers must design
applications to cope with potential inconsistencies and understand the
implications for data accuracy, which can be complex. Achieving strong
consistency often sacrifices availability or performance.
• Lack of Standardization: Unlike SQL, which is a widely adopted standard
for relational databases, there is no single query language or standard
interface across all NoSQL databases. Each database type and even
specific products have their own APIs and query methods (e.g.,
MongoDB's query language, Cassandra's CQL, HBase's API). This leads to
vendor lock-in and requires developers to learn new paradigms for each
database.
• Limited Transaction Support: Full, multi-operation ACID transactions
across multiple records or documents (especially across distributed
nodes) are often not supported in many NoSQL databases, or are
implemented with significant limitations compared to RDBMS.
Applications requiring complex transactions involving multiple data
entities may face challenges in data integrity without careful design at
the application level.
• Maturity and Tooling: While many popular NoSQL databases are
mature, the overall ecosystem (monitoring tools, backup solutions,
administration tools, data analysis tools) is often less mature or
standardized compared to the extensive tooling available for RDBMS.
• Skill Gap: The principles and operational aspects of distributed systems
and specific NoSQL data models require specialized knowledge. Finding
experienced developers and administrators proficient in specific NoSQL
technologies can be challenging.
• Data Modeling Complexity: While the flexible schema is an advantage,
designing effective data models in NoSQL, particularly for handling
relationships that would be simple foreign keys in RDBMS, often
requires a different approach and can be complex, potentially leading to
data duplication or challenges in querying related data.
• Reporting and Analytics: Complex reporting and ad-hoc queries
involving joins across multiple aggregates (documents, rows in different
column families) are often not straightforward or performant in NoSQL
databases compared to SQL. Extracting data for traditional business
intelligence tools can be harder, often requiring ETL (Extract, Transform,
Load) processes into a separate data warehouse or using specialized
analytical tools built for NoSQL.
These challenges highlight that NoSQL is not a universal replacement for
RDBMS but rather a powerful set of tools best applied when their specific
strengths align with application requirements, and where the trade-offs are
acceptable or can be mitigated through application design.

SHORT NOTES ON NEO4J DATABASE

Neo4j is a leading open-source Graph Database. Unlike the aggregate-


oriented NoSQL types (Key-Value, Document, Column-Family) which store
data primarily for efficient retrieval of individual items or aggregates, Graph
databases are specifically designed to store and navigate relationships
between entities (nodes) efficiently. Neo4j is often referred to as a native
graph database because its storage engine is optimized for storing and
traversing graph structures directly.

Data Model:

Neo4j uses a property graph model, which consists of:

• Nodes: Represent entities (e.g., a Person, a Product, a Location). Nodes


can have labels (like "Person", "Product") to group similar entities.
• Relationships: Connect nodes (e.g., "FRIENDS_WITH", "PURCHASED",
"LOCATED_IN"). Relationships always have a direction, a type, and can
have properties themselves.
• Properties: Key-value pairs stored on both nodes and relationships (e.g.,
a Person node might have properties like name, age; a PURCHASED
relationship might have properties like date, quantity).

This model maps very naturally to domains where complex interconnections


and relationships are central to the data.

Key Features:

• Native Graph Storage: Optimized disk structure that makes traversing


relationships extremely fast, regardless of the total size of the database.
• ACID Transactions: Supports full ACID transactions, ensuring data
integrity even in complex scenarios.
• Cypher Query Language: An intuitive, declarative graph query language
designed for pattern matching and traversing graphs. Its syntax is often
described as ASCII-art like, visually representing graph patterns.
• Scalability: Scales vertically by adding resources and horizontally via
clustering (though typically focused more on reads than writes
compared to other NoSQL types).
• High Availability: Clustering provides failover and read replicas.

Use Cases:

Neo4j is particularly well-suited for use cases involving complex relationships,


such as:

• Social Networks (connecting users, posts, groups)


• Recommendation Engines (finding similar products, content, or people
based on relationships)
• Fraud Detection (identifying suspicious patterns and connections)
• Knowledge Graphs and Semantic Web
• Network and IT Operations (mapping dependencies)
• Supply Chain and Logistics

While Neo4j is technically a NoSQL database (as it's non-relational), its focus
on relationships makes it distinct from the aggregate-oriented types. It fills a
niche where the connections between data points are as important, or more
important, than the individual data points themselves.

WORKING PRINCIPLE AND STORAGE MECHANISM OF AGGREGATE-


ORIENTED NOSQL DATABASES

Aggregate-oriented databases are a category of NoSQL databases that store


data as collections of aggregates. An aggregate is a collection of related data
that is treated as a unit. This unit is typically retrieved and updated together.
Key-Value, Document, and Column-Family stores fall into this category. Their
working principles and storage mechanisms are optimized for fast access to
these aggregates, often sacrificing the ability to perform complex
relationships between aggregates at the database level.

a) Key-Value Databases

Working Principle: The most basic model. Data is accessed solely via a unique
key. Operations are simple: put (add/update), get (retrieve), delete. The
database treats the value as an opaque blob; it does not understand the
structure or content of the value. This simplicity allows for extremely high
performance and scalability for read and write operations based on the key.

Storage Mechanism: Data is physically stored as key-value pairs. Common


underlying storage structures include hash tables, B-trees, or Log-Structured
Merge Trees (LSM-Trees). Data is often partitioned across multiple servers
using hashing or range-based partitioning on the key. Values are stored
contiguous with their keys. Replication involves copying these key-value pairs
to other nodes.

Diagram Concept: Imagine a large distributed hash map. Boxes represent


nodes in a cluster. Inside each box, show a list of key-value pairs (e.g., `KeyA ->
ValueA`, `KeyB -> ValueB`). Arrows show a request entering the system with a
key, being routed to the correct node based on the key (e.g., via a hash
function), and retrieving the corresponding value.

Figure 2.1: Conceptual Key-Value Store Distribution

+------------+ +------------+
+------------+
| Node 1 | | Node 2 | |
Node 3 |
|------------| |------------|
|------------|
| Key1 -> Val1 | | Key3 -> Val3 | |
Key5 -> Val5 |
| Key2 -> Val2 | | Key4 -> Val4 | |
Key6 -> Val6 |
+------------+ +------------+
+------------+
^ ^ ^
| Hash Function | Hash Function |
Hash Function
------+-----------------+-----------------
+------
|
Request (GET Key4) --> System Routes to
Node 2

(Note: This is a conceptual representation. Actual storage and


routing are more complex.)

Example: Storing user sessions. Key = session ID (e.g., `session:user123`),


Value = serialized session data (user preferences, cart contents). Operations
are simply getting or setting the session data based on the ID.
b) Document Databases

Working Principle: Data is stored in self-contained documents (e.g., JSON,


BSON, XML). Each document has a unique ID. Documents can have nested
structures and arrays. Querying can be done not only by document ID but
also by fields within the document, offering more flexibility than key-value
stores. The database understands the structure of the document and can
index specific fields.

Storage Mechanism: Documents are stored on disk. The physical storage


might be in a format optimized for read/write, like BSON for MongoDB.
Indexes are built on fields within documents to speed up queries. Sharding
distributes collections of documents or ranges of documents across multiple
servers. Within a server, documents in a collection might be stored
contiguously or linked via indexes.

Diagram Concept: Illustrate a collection like a folder. Inside the folder, show
several boxes representing documents. Each box contains key-value pairs and
potentially nested structures representing the document's content (e.g., `_id:
"doc1", name: "Alice", address: { city: "London" }`). Show an index pointing
from a field value (e.g., `city = "London"`) to the relevant documents. Show
how documents can reside on different nodes in a cluster.

Figure 2.2: Conceptual Document Store (Collection Distribution)

+-----------------+ +-----------------+
| Collection | | Collection |
| (Users) | | (Users) |
| Node 1 | | Node 2 |
|-----------------| |-----------------|
| Doc1 {_id: "A",..}| | Doc3 {_id:
"C",..}|
| Doc2 {_id: "B",..}| | Doc4 {_id:
"D",..}|
| (Index on "name")| | (Index on
"name")|
+-----------------+ +-----------------+
| |
Query: Find users where name="Alice"
-> System checks index, routes to Node 1,
finds Doc1

(Note: Sharding splits collections; indexes allow field-based


queries.)

Example: Storing product catalogs. Each product is a document containing


details like name, description, price, list of tags, nested vendor information.
Querying can find products by tag, price range, or vendor name.

c) Column-Family Stores (Wide-Column Stores)

Working Principle: Data is organized by row keys and column families. Within
a row, columns are grouped into column families. Unlike RDBMS, columns
within a column family for a given row do not need to be predefined; they can
be added dynamically. Data is typically accessed by row key, then filtered by
column family and optionally specific columns. This model is highly efficient
for storing and querying sparse data or data where queries focus on subsets
of columns across many rows.

Storage Mechanism: Data is stored column-by-column within each column


family on disk, rather than row-by-row. This means that all values for a
specific column (within a column family, for all rows that have that column)
are stored together, or at least contiguously for a given row key within that
family. This is distinct from how document or key-value stores might store
data. Replication ensures copies of column families or rows are kept on
multiple nodes. Data is partitioned by row key across the cluster.

Diagram Concept: Show a table structure with Row Keys. Columns are
explicitly grouped into Column Families (e.g., `CF:PersonalInfo`,
`CF:ContactInfo`). Within a row key, show columns belonging to a CF (e.g.,
`Row1 -> CF:PersonalInfo {Name: "Bob", Age: 42}, CF:ContactInfo {Email:
"[email protected]"}`). Emphasize that another row (`Row2`) might only have
`CF:PersonalInfo {Name: "Charlie"}` and no `CF:ContactInfo` columns, or
entirely different columns within `CF:ContactInfo`. Show how data for a
specific column family/column is stored together on disk or in memory.

Figure 2.3: Conceptual Column-Family Store Structure

Row Key | CF:PersonalInfo |


CF:ContactInfo

--------|-------------------------|--------------
--------
user:1 | Name: "Alice", Age: 30 | Email:
"[email protected]"
user:2 | Name: "Bob" | Phone:
"555-1234"
user:3 | Name: "Charlie", City: "NYC" |
Email: "[email protected]", Twitter: "@c"

Physical Storage (Conceptual - within a


partition/node):
/data/CF:PersonalInfo/Name/user:1:"Alice"/
user:2:"Bob"/user:3:"Charlie"
/data/CF:PersonalInfo/Age/user:1:30/
/data/CF:PersonalInfo/City/user:3:"NYC"/
/data/CF:ContactInfo/Email/user:1:"[email protected]"/
user:3:"[email protected]"/
/data/CF:ContactInfo/Phone/user:2:"555-1234"/
/data/CF:ContactInfo/Twitter/user:3:"@c"/

(Note: Data for columns within a CF is grouped, facilitating scans


across a specific column.)

Example: Storing time-series data or user activity feeds. Row key could be
user ID + timestamp. Column families could be 'clicks', 'views', 'purchases'.
Columns within 'clicks' could be URL, timestamp, duration. This allows
efficient storage of sparse event data and querying events of a specific type
for a user within a time range.

d) Aggregate Oriented Databases

Definition: As noted previously in Unit 1, this term serves as an umbrella


category encompassing Key-Value, Document, and Column-Family stores. The
defining characteristic is that data is conceptually and often physically
grouped into aggregates (like key-value pairs, documents, or rows with their
column families) that are stored and retrieved as single units. The focus is on
optimized access to these aggregates rather than complex, join-based
relationships between different aggregates.
Working Principle: Operations primarily involve retrieving, storing, or
updating entire aggregates based on their identifier (key, document ID, row
key). While secondary indexes might allow querying based on contents within
the aggregate (especially in document and column-family stores), the primary
interaction is aggregate-centric. Relationships between aggregates are
typically managed by storing identifiers within an aggregate (e.g., a
document containing an array of IDs referencing other documents) or
handled at the application level.

Mechanism: The specific storage and distribution mechanisms vary by type


(as detailed above), but the common theme is partitioning and replication of
these aggregates across a cluster to achieve scalability and availability.

This category contrasts mainly with Graph databases, which are connection-
oriented, focusing on the relationships between small entities (nodes) rather
than grouping data into larger aggregates.

DISCUSS THE TERM MAP-REDUCE ON DATABASES

MapReduce is a programming model and an associated implementation for


processing large datasets with a parallel, distributed algorithm on a cluster.
While originally popularized by Google and strongly associated with the
Hadoop ecosystem (which includes HBase), the concept is applicable and
sometimes implemented or simulated within other distributed NoSQL
databases for analytical workloads.

The core idea revolves around two primary functions:

1. Map: This function takes an input (e.g., a key-value pair, a document, a


row) and processes it to produce a set of intermediate key-value pairs.
The Map function is applied independently and in parallel across chunks
of the input data distributed across many nodes in the cluster.
2. Reduce: This function takes the intermediate key-value pairs generated
by the Map phase, grouped by the intermediate key, and processes
them to produce a final output. The Reduce function combines or
aggregates values associated with the same intermediate key.

How it applies to Databases (especially NoSQL):


Many NoSQL databases store massive amounts of data distributed across
clusters, making them ideal candidates for parallel processing. MapReduce
provides a framework to perform analytical tasks over this data:

• Offline Analytics: Historically, MapReduce (often via Hadoop/Spark on


data stored in HDFS or HBase) has been used for batch processing large
historical datasets stored in or exported from NoSQL systems for tasks
like counting events, aggregating statistics, or building indexes.
• Built-in Features: Some NoSQL databases (like earlier versions of
MongoDB or CouchDB) offered MapReduce-like features natively for
performing server-side aggregations and queries that couldn't be
efficiently handled by simple lookups or secondary indexes. This allows
processing data locally on the database nodes before returning results,
reducing network traffic.
• Integration with Analytics Platforms: More commonly today, data from
NoSQL databases might be integrated with distributed processing
frameworks like Apache Spark, which often supersedes traditional
MapReduce for many workloads due to performance advantages but
follows similar parallel processing principles.

MapReduce is effective for batch processing workloads where the


computation can be broken down into independent tasks (Map) followed by a
consolidation step (Reduce). It leverages the distributed nature of NoSQL
clusters to achieve high throughput for analytical queries on large datasets.

Diagram Concept: Illustrate input data chunks distributed across several


"Map" nodes. Show each Map node processing its chunk and emitting
intermediate key-value pairs. Then, show a "Shuffle & Sort" step where
intermediate data is grouped by key and sent to "Reduce" nodes. Finally,
show each Reduce node processing the grouped data to produce final output.
Arrows should clearly show the flow from Input -> Map -> Shuffle/Sort ->
Reduce -> Output.

Figure 2.4: Conceptual MapReduce Flow

Input Data Chunks


+---+ +---+ +---+
| D1| | D2| | D3| ...
+---+ +---+ +---+
| | |
v v v
Map Phase (Parallel on Nodes)
+-----+ +-----+ +-----+
| Map1| | Map2| | Map3| ...
| Gen KV| | Gen KV| | Gen KV|
+-----+ +-----+ +-----+
| | |
v v v
Intermediate KV Pairs
[k1, v1], [k2, v2], [k1, v3], ...
|
v
Shuffle & Sort (Group by Key)
+-----------------+
| [k1, [v1, v3]] |
| [k2, [v2]] | ...
+-----------------+
|
v
Reduce Phase (Parallel)
+-------+ +-------+
| Red1 | | Red2 | ...
| Output| | Output|
+-------+ +-------+
| |
v v
Final Output
+-----------+
| Aggregate |
| Result |
+-----------+

(Note: This shows the processing stages across a distributed


system.)

ELABORATE REPLICATION AND SHARDING WITH NEAT SKETCH

Replication and sharding are two fundamental techniques used in distributed


database systems, including NoSQL databases, to achieve scalability, high
availability, and fault tolerance. While both involve distributing data across
multiple servers, they serve different primary purposes and operate
differently.

Replication

Definition: Replication is the process of creating and maintaining multiple


copies of data across different servers or nodes in a database cluster. The
primary goal is to ensure high availability (data remains accessible even if a
node fails) and fault tolerance (the system can withstand failures). Replication
can also improve read performance by distributing read requests across
multiple replicas.

Types of Replication:

• Master-Replica (Primary-Secondary): One node is designated as the


master (primary) and handles all write operations. Replicas (secondaries)
asynchronously or synchronously copy data changes from the master.
Read operations can be distributed across the master and replicas. This
is relatively simple to manage but the master is a single point of failure
for writes (unless automatic failover is configured).
• Multi-Master: Multiple nodes can accept write operations concurrently.
Data changes are then synchronized between the masters. This offers
higher write availability but introduces potential conflicts if the same
data item is modified on different masters simultaneously, requiring
conflict resolution mechanisms.
• Peer-to-Peer (or Masterless): All nodes are peers and can accept both
read and write operations. Data changes are propagated between nodes
using protocols like gossip. This provides high availability and horizontal
write scalability but relies heavily on robust conflict resolution (often
eventual consistency) and distributed coordination. Apache Cassandra
uses this model.

Diagram Concept: Show multiple server nodes. One node is labelled


"Master" (or Primary). Other nodes are labelled "Replicas" (or Secondaries).
Arrows show write operations going only to the Master, and then data
changes flowing from the Master to the Replicas (one-way arrows for master-
replica). Show read operations being handled by multiple nodes (including
replicas). For multi-master or peer-to-peer, show write operations going to
multiple nodes, and data changes flowing between all nodes in a more mesh-
like pattern.

Figure 2.5: Conceptual Master-Replica Replication


+-------+ +--------+ +--------+
| Master|<------| Writes |------>| App |
| Node |<--Sync/Async---| | Clients|
|-------| | +--------+
| Data A| |
| Data B|----Data Flow---| +--------+
| Data C| |------->| Reads |
+-------+ | +--------+
/|\ /|\ /|\
| | |
Reads Handled ----------+---------+--------
| | |
+-------+ +-------+ +-------+
|Replica| |Replica| |Replica|
| Node 1| | Node 2| | Node 3|
|-------| |-------| |-------|
| Data A| | Data A| | Data A|
| Data B| | Data B| | Data B|
| Data C| | Data C| | Data C|
+-------+ +-------+ +-------+

(Note: Writes go to Master; Data is copied to Replicas; Reads can go


to Master or Replicas.)

Sharding (Horizontal Partitioning)

Definition: Sharding is the process of partitioning a large database into


smaller, more manageable pieces called "shards". Each shard is an
independent database segment (often running on a separate server or group
of servers) that contains a subset of the total data. The primary goal is to
achieve horizontal scalability for both read and write operations and to
manage datasets that are too large to fit on a single server.

Types of Sharding:

• Range-Based Sharding: Data is partitioned based on a range of values


of a "shard key" (e.g., records with IDs 1-1000 go to Shard 1, 1001-2000
to Shard 2). Simple to implement for ordered data but can lead to
hotspots if data access is concentrated in certain ranges.
• Hash-Based Sharding: Data is partitioned based on the hash value of
the shard key. This distributes data more evenly across shards, reducing
hotspots. However, retrieving a range of data requires querying all
shards.
• Directory-Based Sharding: A lookup service (or directory) maintains a
map of which data (or which range/hash of keys) resides on which shard.
This adds complexity but offers flexibility, allowing shards to be moved
or rebalanced dynamically by updating the directory.

Diagram Concept: Show the total dataset being divided into distinct partitions
(Shards). Each Shard resides on a separate server node. Show a routing layer
or a client knowing how to direct a read/write request for a specific data item
(identified by the shard key) to the correct Shard.

Figure 2.6: Conceptual Sharding (Hash-Based Example)

App Clients
+--------+
| Client |
| Requests data |
+--------+
| Request for KeyX
v
+-------------------+
| Routing/Query |
| Layer (e.g., Mongos)|
|-------------------|
| Determines Shard for|
| KeyX (e.g., hash(KeyX) % N)|
+-------------------+
| Request routed to Shard 2
v
+-----------+ +-----------+
+-----------+
| Shard 1 | | Shard 2 | |
Shard 3 | ...
| (Server A)| | (Server B)| |
(Server C)|
|-----------| |-----------|
|-----------|
| Data K-V | | Data L-P |<------|
Data Q-T |
+-----------+ +-----------+
+-----------+
(Contains KeyX)

(Note: Data is split across Shards; a router directs requests to the


correct Shard.)

DISCUSS HOW SHARDING AND REPLICATION IF COMBINED CAN


IMPROVE SCALABILITY OF THE DB SYSTEM?

Combining sharding and replication is a common and powerful strategy in


distributed NoSQL databases to achieve both high availability and massive
horizontal scalability.

How they work together:

In a combined system, the dataset is first partitioned into multiple shards


(using range, hash, or directory-based methods). Then, each individual shard
is replicated across several nodes. So, instead of having a single, non-
replicated shard on a server, each shard becomes a replicated set of data. A
cluster consists of multiple replicated shards.

Benefits for Scalability and Availability:

1. Enhanced Write Scalability: Sharding distributes write operations across


multiple servers (the primary replica of each shard). This allows the
system to handle a much higher volume of write traffic in parallel
compared to a single server or a replicated system without sharding
(where all writes might bottleneck at the master).
2. Enhanced Read Scalability: Read operations are first directed to the
correct shard based on the shard key. Within that shard's replicated set,
read requests can be distributed among the replicas. This multiplies the
read capacity – not only are reads distributed across different shards,
but reads for data within a specific shard are distributed across its
replicas.
3. Improved Availability and Fault Tolerance: If a node hosting a primary
shard replica fails, a secondary replica from the same shard on another
node can be promoted to primary. If an entire node or a significant
portion of the cluster goes down, only the data on the failed shards
hosted *only* on those nodes is affected (if not sufficiently replicated).
With replication on each shard, the system can tolerate the loss of nodes
hosting replicas without losing data or becoming unavailable, as long as
at least one replica for each shard remains accessible.
4. Management of Large Datasets: Sharding allows the total dataset size
to exceed the capacity of any single node, while replication within each
shard ensures that even if a node holding a shard fails, the data remains
available elsewhere.
5. Reduced Contention: By distributing both data and read/write load
across many machines and replicas, contention for resources (CPU,
memory, disk I/O) on any single machine is reduced.

In essence, sharding divides the problem of scaling (both data volume and
request load) into smaller, more manageable pieces (shards), and replication
ensures that each of these pieces is highly available and can handle a higher
volume of read requests. The combination provides a robust foundation for
building highly scalable and fault-tolerant database systems capable of
handling web-scale workloads.

Diagram Concept: Show the total dataset conceptually divided into Shard 1,
Shard 2, Shard 3, etc. Then, for *each* Shard, show it being replicated across
multiple nodes, forming replicated sets (e.g., Shard 1 Replicated Set, Shard 2
Replicated Set). Show requests coming in and being routed first to the correct
Shard's replicated set, and then potentially directed to a specific replica within
that set for reads. Show how the failure of a node within one replicated set
does not affect the availability of other replicated sets or even the data within
the same set if other replicas exist.

Figure 2.7: Conceptual Combined Sharding and Replication

App Clients
+--------+
| Client |
+--------+
| Request for KeyX
v
+-------------------+
| Routing/Query |
| Layer (e.g., Mongos)|
+-------------------+
| Routes Request based on KeyX
| (e.g., to Shard 2 Replicated Set)
v
+-----------------+ +-----------------+
+-----------------+
| Shard 1 Replicated| | Shard 2 Replicated| |
Shard 3 Replicated| ...
| Set | | Set | |
Set |
| (Contains Data A-J)| | (Contains Data K-P)|
| (Contains Data Q-Z)|
|-----------------| |-----------------|
|-----------------|
| +---------+ | | +---------+ | |
+---------+ |
| | Node 1 | | | | Node 4 | | |
| Node 7 | |
| | Replica | | | | Master |<----+-|-|
| Master | |
| +---------+ | | | (Shard 2)| | |
| (Shard 3)| |
| +---------+ | | +---------+ | |
+---------+ |
| | Node 2 | | | +---------+ | |
+---------+ |
| | Master |<--- +---+ | Node 5 | | | |
Node 8 | |
| | (Shard 1)| | | | Replica | | |
| Replica | |
| +---------+ | | +---------+ | |
+---------+ |
| +---------+ | | +---------+ | |
+---------+ |
| | Node 3 | | | | Node 6 | | | |
Node 9 | |
| | Replica | | | | Replica | | | |
Replica | |
| +---------+ | | +---------+ | |
+---------+ |
+-----------------+ +-----------------+
+-----------------+

(Note: Dataset is split into Shards. Each Shard has multiple Replicas
on different Nodes. Requests are routed to the correct Shard's
replica set.)

UNIT 3: MONGODB AND APPLICATIONS OF


DOCUMENT DATABASES
HOW MONGODB ADDS VALUE TO MODERN APPLICATIONS

MongoDB, as a leading NoSQL document database, brings significant value to


modern application development, particularly in environments requiring
agility, scalability, and the ability to handle diverse data types. Its design
addresses many limitations faced by traditional relational databases in the
context of today's fast-evolving digital landscape.

Key ways MongoDB adds value include:

• Flexible Schema: Modern applications often have rapidly changing data


requirements. MongoDB's document model allows developers to evolve
the schema without complex, time-consuming migrations that are
typical in relational databases. New fields can be added to documents
easily, accommodating changes in application features and data sources
quickly.
• Developer Productivity: Storing data in JSON-like BSON documents
aligns naturally with object-oriented programming paradigms.
Developers can often map application objects directly to documents,
reducing the need for Object-Relational Mapping (ORM) layers and
simplifying code, thus accelerating development cycles.
• Scalability: MongoDB is designed for horizontal scalability out-of-the-
box using sharding. It can distribute data and load across clusters of
commodity servers, enabling applications to handle massive volumes of
data and traffic growth seamlessly and cost-effectively.
• Performance: Retrieving an entire document often involves a single read
operation, which can be faster than performing multiple joins across
normalized tables in an RDBMS to reconstruct a complex object.
MongoDB also offers various index types (including secondary,
geospatial, text) to optimize query performance.
• Rich Querying Capabilities: Unlike simpler Key-Value stores, MongoDB
provides a powerful query language that supports complex filtering,
range queries, geospatial queries, text search, and a robust aggregation
framework for analytical processing directly within the database.
• High Availability: Replication using replica sets provides automatic
failover and data redundancy, ensuring applications remain available
even if primary nodes fail.
• Support for Unstructured and Semi-structured Data: MongoDB is
inherently suited for handling data that doesn't fit neatly into rigid rows
and columns, such as user profiles, product catalogs with varying
attributes, content, and log data.

These features collectively make MongoDB an excellent choice for a wide


range of modern applications, including web and mobile backends, content
management systems, product catalogs, real-time analytics, and IoT data
platforms.

DOCUMENT DATABASES AND THEIR RELATIONSHIPS WITH OTHER


DATABASES

What is a Document Database?

A document database is a type of NoSQL database that stores data in flexible,


semi-structured format known as "documents". These documents are self-
contained units, often using formats like JSON (JavaScript Object Notation),
BSON (Binary JSON, used by MongoDB), or XML. Each document typically has
a unique identifier and can contain fields, arrays, and nested sub-documents.

Key characteristics include:

• Flexible Schema: Documents within the same collection can have


different structures, allowing for easy evolution of the data model.
• Hierarchical Data: Documents can represent complex hierarchical
relationships through nested fields and arrays, naturally mirroring
object structures in programming languages.
• Querying by Content: Databases can index and query based on the
content of fields within the documents, not just the document ID.
• Aggregate-Oriented: Data is stored and retrieved as aggregates (the
documents), contrasting with relational databases that spread an
entity's data across multiple tables.
Relationship with Other Database Types:

• vs. Relational Databases (RDBMS): Document databases differ


fundamentally from RDBMS by using a flexible, non-tabular schema and
storing related data within a single document rather than normalizing it
across multiple tables linked by foreign keys. This reduces impedance
mismatch with object-oriented code and improves performance for
retrieving complex aggregates but can make querying relationships
between documents more complex.
• vs. Key-Value Stores: Document databases can be seen as an evolution
of key-value stores. While a key-value store treats the value as an
opaque blob, a document database understands the structure within
the document (the value) and allows querying and indexing on internal
fields. A document ID acts as the key, and the document content is the
value, but the database provides more functionality based on the value's
structure.
• vs. Column-Family Stores: Column-family stores organize data by rows
and flexible columns within column families, optimized for querying
subsets of columns across vast numbers of rows. Document databases
store self-contained documents, better suited for retrieving and
manipulating entire entities with varying structures.
• vs. Graph Databases: Graph databases are designed specifically to
model and query relationships between entities as nodes and edges.
While document databases can store relationships by embedding IDs or
references, traversing complex, deep relationships is typically much
more performant and natural in a graph database than in a document
database.

Document databases strike a balance between the simplicity and speed of


key-value stores and the query power and structured nature of relational
databases, making them suitable for a wide range of applications where data
structure flexibility is key.

SIX APPLICATIONS OF DOCUMENT DATABASES

Document databases excel in applications where data is semi-structured,


evolves frequently, or where retrieving data as self-contained units is efficient.
Here are six common applications:

1. Content Management Systems (CMS): Websites, blogs, and publishing


platforms manage articles, pages, user comments, and media. Each
piece of content can be stored as a document with varying fields (title,
body, author, publish date, tags, comments array). The flexible schema
handles different content types easily.
2. Product Catalogs and E-commerce: Product information often has
diverse attributes (e.g., electronics vs. clothing). Storing each product as
a document allows for product-specific fields without a sparse RDBMS
schema. Queries can filter products based on any attribute within the
document.
3. User Profiles and Personalization: User profiles can store varied
information like contact details, preferences, activity logs, and social
connections. Document databases handle the optional and evolving
nature of user data well, enabling rich profiles and personalized
experiences.
4. Mobile Applications: Mobile apps often require flexible data models and
easy synchronization. Document databases like MongoDB can serve as a
robust backend, with documents mapping directly to app data
structures, simplifying development.
5. Internet of Things (IoT) Data: Devices generate diverse data streams
(sensor readings, device status). Storing each reading or status update
as a document allows for easy ingestion and querying of large volumes
of time-series or event-based data, even with varying data points per
device.
6. Real-time Analytics and Event Logging: Application logs, clickstream
data, and sensor events can be stored as documents. Their flexibility
handles heterogeneous log formats, and the ability to query fields
allows for fast filtering, aggregation, and real-time dashboards.

CONSISTENCY MODELS USED IN NOSQL DATABASES

Consistency in distributed systems refers to the property that ensures every


read receives the most recent write. The CAP theorem states that a
distributed system cannot simultaneously guarantee Consistency, Availability,
and Partition Tolerance. NoSQL databases often make trade-offs among these
properties, leading to various consistency models.

Common consistency models in NoSQL databases include:

• Strong Consistency: All nodes have the most up-to-date version of the
data. After a write operation completes, any subsequent read operation
is guaranteed to return the latest value. This is the model often
associated with ACID-compliant RDBMS and certain NoSQL systems (like
HBase per row, or MongoDB with specific write/read concerns). It
requires coordination across nodes for every write, which can impact
availability and latency during network partitions.
• Eventual Consistency: If no new writes occur on a data item, eventually
all reads of that item will return the last written value. There might be a
period after a write where different replicas return different values. This
model prioritizes availability and partition tolerance over immediate
consistency. It's common in AP systems under CAP, such as Cassandra
and many Key-Value stores. Applications using this model must be
designed to handle temporary inconsistencies.
• Causal Consistency: If process A has seen an update, process B (which
causally depends on A) will also see the update, and in the same order.
Writes that are not causally related can be seen in different orders on
different nodes. This is stronger than eventual consistency but weaker
than strong consistency.
• Read-Your-Own-Writes Consistency: If a process writes a data item, any
subsequent read by that same process will return the value just written.
This guarantees that a user sees their own updates immediately, even if
other users might not yet see them (due to eventual consistency
propagation).
• Session Consistency: A client within a single 'session' (e.g., a user
session in a web application) will experience Read-Your-Own-Writes
consistency. The system attempts to maintain consistency for that
specific client's sequence of operations, though different sessions might
still see data in different states relative to each other.
• Bounded Staleness: A system with bounded staleness guarantees that
reads are not "too" stale. This can be defined by time (e.g., reads are no
more than 10 seconds behind the master) or by the number of updates
(e.g., reads reflect at least the last 5 updates). It provides a quantifiable
bound on the inconsistency.

Choosing a consistency model involves trade-offs. Strong consistency


simplifies application development but can reduce availability and
performance. Eventual consistency requires more complex application logic
to handle potential staleness but offers higher availability and scalability.
MONGODB QUERY FEATURES SUMMARY

MongoDB provides a rich set of query features compared to simpler NoSQL


types, allowing applications to retrieve and manipulate documents effectively:

• Document-Based Queries: Queries target collections and use a JSON-


like syntax to specify criteria based on field values, data types, array
elements, and nested documents.
• CRUD Operations: Standard Create, Read, Update, and Delete
operations on documents. Read operations use methods like find()
and findOne() .
• Query Operators: A wide array of operators for comparisons ( $gt ,
$lt , $eq ), logical operations ( $and , $or , $not ), array
manipulation ( $in , $all , $elemMatch ), and more.
• Projections: Ability to select which fields to include or exclude in the
query results, reducing the amount of data transferred.
• Sorting and Pagination: Results can be sorted by one or more fields, and
queries support skipping and limiting results for pagination.
• Indexing: Supports various index types (single field, compound, multi-
key for arrays, geospatial, text, hashed) to accelerate queries.
• Geospatial Queries: Specialized operators for querying geographical
data (e.g., finding points within a radius or polygon).
• Text Search: Ability to perform full-text search queries on string content,
supporting various languages.
• Aggregation Framework: A powerful pipeline-based framework for
performing complex data transformations and aggregations (grouping,
filtering, projecting, joining data from different collections - similar to
JOINs but in the application layer or aggregation pipeline stages).

These features provide developers with robust tools for interacting with data
stored in documents, enabling sophisticated data retrieval and analysis
directly within the database.

SHORT NOTES ON EVENT LOGGING AND CONTENT MANAGEMENT


SYSTEMS

a) Event Logging:

Event logging involves capturing and storing a continuous stream of discrete


events generated by applications, systems, or devices (e.g., user clicks, system
errors, sensor readings, server requests). These logs are often high-volume,
time-sensitive, and have varying structures. NoSQL databases, particularly
document databases and column-family stores, are well-suited for event
logging because:

• They can handle the high write throughput required to ingest massive
event streams.
• Their flexible schema accommodates the diverse and evolving nature of
log data from different sources.
• They scale horizontally to store potentially petabytes of data.
• Features like time-based partitioning or indexing allow for efficient
querying and analysis of logs over specific time ranges.

Document databases like MongoDB can store each event as a document, with
fields for timestamp, event type, source, user ID, and arbitrary payload data.
This allows rich querying on specific event attributes. Column-family stores
like Cassandra or HBase can store events with timestamp as part of the row
key and event details in columns, optimized for querying events within a time
window for a specific entity.

b) Content Management Systems (CMS):

CMS platforms manage digital content like articles, blog posts, web pages,
images, and videos. This content is often semi-structured and needs to be
easily created, edited, stored, and retrieved for presentation. Document
databases are a natural fit for CMS backends because:

• Each content item (article, page) can be represented as a document.


• The flexible schema allows different content types (e.g., news article vs.
product page) to have different fields without complex database
alterations.
• Hierarchical structures within documents (e.g., storing comments as an
array within a blog post document) map well to content relationships.
• Indexing on fields like title, author, tags, or publication date enables
efficient content retrieval and search.
• Scalability supports growth in content volume and user traffic.

Using MongoDB for a CMS allows storing articles, user data, categories, tags,
and comments efficiently. An article document could contain fields like title,
body, author ID, publication date, status, and an array of embedded comment
documents. User documents could store profile information and links to
authored articles.
BUILDING A BLOGGING PLATFORM WITH MONGODB

While the prompt mentions using MongoDB as a "key-value database" for a


blogging platform, MongoDB is fundamentally a document database.
However, its document structure, with a unique _id for each document, can
*function* conceptually like a key-value store where the _id is the key and
the entire document is the value. For a blogging platform, this provides
flexibility beyond a simple opaque value.

A blogging platform involves managing several types of data: blog posts,


users (authors), and potentially comments or tags. Using MongoDB, we can
model these using collections:

• posts collection: Each document represents a blog post.


◦ _id : Unique identifier for the post (e.g., ObjectId or a slug).
◦ title : String (Post title).
◦ slug : String (URL-friendly identifier).
◦ author_id : ObjectId reference to the user who wrote the post.
◦ publish_date : Date (When the post was published).
◦ content : String (HTML or Markdown body of the post).
◦ tags : Array of Strings (Tags for categorization).
◦ status : String (e.g., 'draft', 'published').
◦ comments : Array of embedded comment documents (or
references to a separate comments collection for large comment
volumes).
• users collection: Each document represents an author.
◦ _id : Unique identifier for the user.
◦ username : String.
◦ email : String.
◦ bio : String (Author's biography).
◦ ... other profile information ...
• *(Optional) comments collection:* If comments are large or shared.
◦ _id : Unique identifier for the comment.
◦ post_id : ObjectId reference to the post.
◦ author : String (or ObjectId reference to user).
◦ content : String (Comment body).
◦ timestamp : Date.

Working Principle:

To display a blog post, the application queries the posts collection using
the post's slug or _id (functioning like a key lookup for the document). The
retrieved document contains all necessary post details, including embedded
comments if that model is chosen. To show author information, a separate
query can fetch the corresponding user document using the author_id .

Adding a new post is an insert operation into the posts collection. Adding a
comment is an update operation on the specific post document (if
embedding) or an insert into the comments collection (if separate). Queries
can find posts by tags, author, date range, etc., leveraging MongoDB's
indexing capabilities.

This model leverages the document structure to keep related post data
together, making retrieval efficient, while using separate collections for users
and potentially comments to manage different entity types and relationships.

Figure 3.1: Conceptual MongoDB Data Model for a Blogging


Platform

+-----------------+ +-----------------+
| Collection: | | Collection: |
| posts | | users |
|-----------------| |-----------------|
| Document 1: | | Document A: |
| { | | { |
| _id: ObjectId, | | _id:
ObjectId, |
| title: "Post A",| | username:
"Author1",|
| slug: "post-a", | | ...
|
| author_id: RefA, | | }
|
| publish_date:..,| +-----------------
+
| content: "...", | ^
| tags: ["nosql"],| |
| comments: [ | author_id links
| { user:.. content:.. }, | |
| { user:.. content:.. } | |
| ] | |
| } | |
| | |
| Document 2: | |
| { | |
| _id: ObjectId, | |
| title: "Post B",| |
| slug: "post-b", | |
| author_id: RefB, |<-------------------+
| ... | |
| } | |
+-----------------+ |
+-----------------+
| Document B: |
| { |
| _id: RefB,
|
| username:
"Author2",|
| ... |
| } |
+-----------------+

// Optional separate comments collection


+-----------------+
| Collection: |
| comments |
|-----------------|
| Document X: |
| { |
| _id: ObjectId, |
| post_id: Ref1, |<-------------------+
| author: "Guest",| |
| content: "...", | |
| timestamp: .. | |
| } | |
| | |
| Document Y: | |
| { | |
| _id: ObjectId, | |
| post_id: Ref1, |<---------+
| author_id: RefA,| |
| content: "...", | |
| timestamp: .. | |
| } | |
+-----------------+

(Note: This shows collections of documents. References are


conceptual links, not strict foreign keys like in RDBMS. Comments
can be embedded or in a separate collection.)

REAL-TIME ANALYTICS WITH NOSQL DATABASES FOR INDUSTRY


APPLICATIONS

Real-time analytics involves processing and analyzing data as it is generated


or arrives, providing immediate insights. Industrial applications, such as
manufacturing, logistics, energy, and monitoring systems, generate vast
amounts of data from sensors, machinery, and processes that are often
unstructured or semi-structured and arrive at high velocity. NoSQL databases,
particularly those optimized for handling high write throughput and flexible
data, are crucial for enabling real-time analytics in these scenarios.

How NoSQL facilitates Real-Time Analytics:

• High-Velocity Data Ingestion: NoSQL databases like Cassandra


(optimized for writes) or MongoDB (flexible ingestion of varied
documents) can handle the continuous stream of high-volume data
generated by industrial equipment and sensors without becoming a
bottleneck.
• Flexible Data Models: Industrial data can come in many formats from
different types of sensors or machines. The flexible schema of document
or column-family stores allows ingesting this heterogeneous data
without predefined schemas, enabling faster onboarding of new data
sources.
• Scalability: Horizontal scalability allows the database layer to grow with
the increasing volume of data generated by expanding operations or
adding more sensors.
• Fast Querying for Recent Data: NoSQL databases can be designed and
indexed for rapid access to the most recent data points, crucial for
monitoring current conditions, detecting anomalies, and triggering
immediate actions. Time-series data patterns are particularly well-suited
for column-family stores or document stores with time-based keys/
indexes.
• Integration with Processing Pipelines: NoSQL databases serve as a
landing zone for raw or semi-processed industrial data. This data can
then be fed into real-time processing frameworks (e.g., Apache Kafka,
Spark Streaming, Flink) for complex analysis, anomaly detection,
predictive maintenance, and visualization on dashboards, enabling
immediate operational insights.

For example, in a manufacturing plant, sensor data on temperature, pressure,


and vibration from machines can be ingested into a NoSQL database. Real-
time analytics can monitor this data stream to detect deviations indicating
potential equipment failure, triggering alerts for proactive maintenance and
preventing costly downtime. The NoSQL database stores the raw data for
historical analysis and serves the recent data for immediate monitoring.

SHORT NOTES ON E-COMMERCE AND REAL-LIFE EXAMPLES OF


MONGODB

a) E-commerce:

E-commerce platforms involve complex data related to products, customers,


orders, inventory, and sessions. Document databases are a strong fit for
several aspects of e-commerce:

• Product Catalogs: As discussed earlier, products have highly varied


attributes, making the flexible document model ideal.
• Customer Profiles: Storing diverse customer information, preferences,
and browsing history.
• Shopping Carts and Sessions: Key-value or document stores can
efficiently manage transient session data and shopping cart contents,
which are frequently updated.
• Order History: Storing details of each order, which can include lists of
items, shipping information, payment status, etc., often in a nested
structure within the order document.
• Content Management: Managing website content, promotional
material, and product descriptions (linking back to CMS use case).

NoSQL's scalability handles peak traffic loads, and the flexible schema allows
e-commerce platforms to quickly introduce new product types or features.

b) Real Life Examples of MongoDB:


MongoDB is used by a wide range of companies across various industries due
to its versatility and scalability. Examples include:

• Netflix: Uses MongoDB for various purposes, including managing


subscriber data and content metadata, leveraging its scalability and
flexible schema to handle diverse data types and rapid growth.
• Adobe: Employs MongoDB for managing data within its Experience
Cloud products, particularly for handling large volumes of diverse
customer interaction data and providing real-time insights.
• Electronic Arts (EA): Uses MongoDB for handling player data, game
states, and leaderboards in online games, benefiting from its ability to
store complex, evolving player profiles and scale with user load.
• Expedia: Utilizes MongoDB for managing travel data, such as hotel
details and search results, where the data structure can be highly
variable and real-time access is critical.
• The Weather Channel: Uses MongoDB to store and process massive
volumes of real-time weather data, enabling fast retrieval and analysis
for forecasts and applications.

These examples highlight MongoDB's applicability in scenarios requiring high


performance, scalability, flexible data models, and the ability to handle large,
complex datasets in real-world applications.

CHAPTER 4: COLUMN ORIENTED NOSQL


DATABASES
PURPOSE OF APACHE HBASE COLUMN-ORIENTED NOSQL
DATABASE

Apache HBase is an open-source, non-relational, distributed database


modeled after Google's Bigtable. It is designed to provide random, real-time
read/write access to petabytes of data. HBase is not a direct replacement for
SQL databases but is built to handle specific types of workloads that
traditional relational databases struggle with, particularly those involving very
large, sparse datasets and high-volume reads and writes.

As a column-oriented (or more accurately, a wide-column store) database,


HBase stores data in a way that allows for efficient retrieval of data by column
families rather than by row across all potential columns. This makes it highly
suitable for scenarios where you have billions of rows but only need to access
a few columns or column families at a time.
The primary purposes of Apache HBase include:

• Handling Massive Sparse Datasets: It is optimized for tables with


potentially billions of rows and millions of columns, where many column
values for a given row might be empty (sparse). Storing data by column
family makes this efficient.
• Real-time Read/Write Access on Large Data: Unlike Hadoop's HDFS
(which is batch-oriented), HBase provides low-latency random read and
write access to individual records within the massive dataset stored on
HDFS.
• Integration with Hadoop Ecosystem: HBase is typically built on top of
HDFS and integrates tightly with other Hadoop components like
MapReduce (though modern approaches might use Spark for
processing data stored in HBase). This allows for both real-time
operational access and batch analytical processing on the same data.
• Versioning: HBase automatically versions data, storing multiple
timestamped versions of a cell's value. This is useful for tracking
changes over time.
• Strong Consistency within a Row: HBase offers strong consistency for
operations within a single row, which is a key differentiator from some
other NoSQL databases that only offer eventual consistency.

In essence, HBase serves as the operational database layer on top of HDFS,


providing fast lookups and updates on vast amounts of data that would
overwhelm traditional databases.

ARCHITECTURE OF APACHE HBASE

Apache HBase operates as a distributed system designed for scalability and


fault tolerance. Its architecture consists of several key components:

1. HMaster:

• Acts as the master server. There can be multiple HMasters for failover,
but only one is active at a time.
• Responsible for coordinating RegionServers, managing table schemas,
handling region assignments (assigning data partitions to
RegionServers), load balancing regions, and handling region server
failures.
• Does not serve data itself; it's primarily a metadata and coordination
service.
2. RegionServers:

• These are the worker nodes that host and manage data regions.
• Each RegionServer is responsible for a subset of the table's data (one or
more regions).
• Handles read and write requests for the regions it serves directly from
client applications.
• Communicates with HDFS to store and retrieve data (in HFiles) and uses
a Write-Ahead Log (WAL) for durability.

3. Regions:

• A region is a contiguous sorted range of rows in an HBase table.


• Tables are initially created with a single region and are automatically
split into smaller regions as data grows.
• Regions are the basic unit of distribution and load balancing in HBase;
each region is assigned to a specific RegionServer.

4. ZooKeeper:

• HBase relies on ZooKeeper for distributed coordination.


• Used for master election (determining the active HMaster), managing
configuration, tracking the state of RegionServers (which regions are on
which server), and providing a distributed locking mechanism.

5. HDFS (Hadoop Distributed File System):

• HBase stores its data files (HFiles) and Write-Ahead Logs (WALs) on
HDFS.
• HDFS provides the underlying distributed, fault-tolerant storage layer.
• HDFS DataNodes store the actual data blocks, and the NameNode
manages the HDFS metadata.

Data Flow (Simplified):

• Clients interact directly with RegionServers for data reads and writes.
• The client library uses ZooKeeper or HMaster to find the RegionServer
hosting the region for a given row key.
• RegionServers store writes temporarily in a MemStore (in memory) and
also write to a WAL (on HDFS) for durability.
• When a MemStore reaches a certain size, its contents are flushed to disk
as an HFile on HDFS.
• Reads first check the MemStore, then HFiles.
• Periodically, HFiles are merged (compaction) to optimize storage and
read performance.

Figure 4.1: Apache HBase Architecture

+-------------------------------------+
+-----------------+
| Client Applications |
| ZooKeeper |
+-------------------------------------+
| (Coordination, |
|
| Master Election)|
|
+-----------------+

| ^

v |
+-------------------------
+ |
| HBase Client Library |
<--------------------------+
| (Finds RegionServer) |
+-------------------------+
|
| Read/Write Requests
v

+------------------------------------------------
-+
|
HMaster |
| (Region Assignment, Load Balancing,
Failover) |

+------------------------------------------------
-+
| Manages/Monitors
v
+---------------------+
+---------------------+ +---------------------
+
| RegionServer 1 | | RegionServer
2 | | RegionServer N |
|---------------------|
|---------------------|
|---------------------|
| Region A | | Region
B | | Region C |
| +---------+ | | +---------
+ | | +---------+ |
| | MemStore| | | |
MemStore| | | | MemStore| |
| +---------+ | | +---------
+ | | +---------+ |
| +---------+ | | +---------
+ | | +---------+ |
| | HFiles |------>|----| | HFiles
|------>|----| | HFiles |------>|
| +---------+ | | +---------
+ | | +---------+ |
| +---------+ | | +---------
+ | | +---------+ |
| | WAL |------>|----| | WAL
|------>|----| | WAL |------>|
| +---------+ | | +---------
+ | | +---------+ |
+---------------------+
+---------------------+ +---------------------
+
|
| |
v
v v

+------------------------------------------------
-+
| HDFS (Hadoop Distributed File
System) |
| (Stores HFiles and
WALs) |

+------------------------------------------------
-+
/ | \
/ | \
+--------+ +--------+ +--------+
|DataNode| |DataNode| |DataNode| ...
+--------+ +--------+ +--------+

(Note: HFiles contain the actual data. WAL ensures durability before
data is written to HFiles. MemStore is in-memory cache for writes.)

ADVANTAGES AND DISADVANTAGES OF APACHE HBASE

Advantages:

• Scalability: Horizontally scales to handle billions of rows and millions of


columns across a cluster of commodity servers.
• Random Real-time Access: Provides low-latency reads and writes to
individual records within massive datasets, unlike the batch processing
nature of raw HDFS.
• Integration with Hadoop: Seamlessly integrates with HDFS for storage
and can be easily used with Hadoop ecosystem tools for analytics (e.g.,
MapReduce, Spark).
• Strong Consistency (within a row): Guarantees strong consistency for
operations within a single row, which is important for many applications.
• High Availability: Built on HDFS's fault tolerance and includes features
like HMaster failover and region server recovery to ensure data
availability.
• Versioning: Automatically keeps multiple timestamped versions of cell
values, allowing retrieval of historical data.
• Schema Flexibility: While column families must be defined upfront, new
columns can be added dynamically within a column family without
schema changes.
Disadvantages:

• Complexity: HBase is a complex distributed system requiring expertise


to set up, configure, and manage, especially alongside HDFS and
ZooKeeper.
• No SQL Support: Does not support SQL. Data access is primarily through
its native APIs (Java, etc.) or higher-level abstractions (like Apache
Phoenix for SQL-like access, but this is a layer on top). This makes
migration from RDBMS challenging.
• Limited Querying: Optimized for row key lookups and scans within
column families. Complex queries across multiple rows, joins, or
aggregations are typically performed using integrated processing
frameworks like Spark or MapReduce, not within HBase itself.
• Higher Latency for Simple Lookups: Compared to in-memory caches or
some simpler NoSQL databases, HBase might have higher latency for
very simple key-value lookups due to its architecture built on HDFS and
LSM-tree storage structure.
• Storage Overhead: Can have storage overhead due to its versioning and
storage format.
• Reliance on HDFS: Its performance and availability are tied to the
underlying HDFS cluster.
• Hotspotting: Poorly chosen row keys can lead to "hotspots" where one
RegionServer handles a disproportionate amount of the traffic.

SHORT NOTES ON APPLICATIONS AND FEATURES OF APACHE


HBASE

a) Applications of Apache HBase:

HBase is commonly used in scenarios requiring random, real-time access to


very large datasets, often within a Big Data ecosystem:

• Time-Series Data: Storing and querying large volumes of timestamped


data from sensors, monitoring systems, or financial markets. Row keys
can combine identifiers and timestamps for efficient range scans.
• Operational Data Store for Hadoop: Serving data that is also processed
by batch jobs on Hadoop (e.g., storing user activity logs that are later
analyzed).
• Web Analytics: Storing clickstream data, user interactions, and session
information for large websites.
• Internet of Things (IoT): Ingesting and providing access to massive
amounts of sensor data from connected devices.
• Large-Scale Messaging Systems: Storing message queues or event
streams.
• Fraud Detection: Storing transaction data for real-time lookup and
pattern matching.
• Archival and Versioning: Storing historical records where access to
previous versions is needed.

b) Features of Apache HBase:

• Column-Family Oriented: Data is organized into column families,


allowing flexible columns within those families.
• Schema Flexibility: Columns within a column family are dynamic; no
need to predefine all columns.
• Automatic Sharding: Tables are automatically partitioned into regions,
which are distributed across the cluster.
• Automatic Failover: HMaster and RegionServer failures are handled
automatically (via ZooKeeper and HMaster/backup HMasters).
• Consistent Reads and Writes: Strong consistency for operations within a
row.
• Versioned Data: Multiple versions of data for each cell are maintained,
timestamped by default.
• Integration with Hadoop: Leverages HDFS for reliable storage and
integrates with MapReduce/Spark for batch processing.
• Compression: Supports various compression algorithms to reduce
storage space on HDFS.

DIFFERENTIATION BETWEEN APACHE HBASE AND RDBMS

HBase and RDBMS are fundamentally different database systems designed


for different purposes and workloads.

Feature Apache HBase Relational Databases (RDBMS)

Wide-Column Store (Row Key, Column


Relational Model (Tables with fixed
Families, Columns, Timestamps).
Data Model rows and columns). Data
Optimized for sparse data and column-
normalized across multiple tables.
family access.

Schema-on-read for columns within Strict, predefined schema


families; column families defined (schema-on-write). Changes
Schema
upfront. Flexible column addition require explicit schema
within families. alterations.
Feature Apache HBase Relational Databases (RDBMS)

Native API (Java, REST, Thrift), Filters,


SQL (Structured Query Language).
limited SQL-like access via layers like
Query Powerful for complex joins,
Phoenix. Optimized for Row Key
Language aggregations, and ad-hoc queries
lookups and Column/Column Family
across tables.
scans.

Strong Consistency within a row. Strong Consistency (ACID


Eventual consistency across rows/ properties) for transactions,
Consistency
regions during some operations (like typically across multiple rows and
splits). tables.

Horizontal Scaling (scaling out) across Primarily Vertical Scaling (scaling


Scalability commodity hardware. Designed for up). Horizontal scaling is complex
petabytes of data. and often expensive.

Big Data operational store, time-series Transactional applications,


data, event logging, applications complex reporting, applications
Use Cases
needing real-time access to very large, needing strong consistency and
sparse datasets. complex multi-row transactions.

Underlying
Typically HDFS. Usually local file system or SAN.
Storage

Not supported natively; must be


handled at the application level or via Native support for JOIN
Joins
processing frameworks like Spark/ operations via SQL.
MapReduce.

In summary, RDBMS is ideal for structured data with complex relationships


and transactional integrity requirements, while HBase excels at storing and
providing random, real-time access to massive, sparse datasets, often as part
of a larger Big Data architecture.

HBASE READ AND WRITE OPERATIONS EXPLAINED

HBase's read and write paths are designed for efficiency in a distributed
environment, leveraging the MemStore and HFiles structure on HDFS.

Write Operation Process:

1. Client Request: A client application sends a Put request (containing the


row key, column family, column, value, and optional timestamp) to the
HBase cluster.
2. RegionServer Identification: The HBase client library (or a router)
determines which RegionServer hosts the region responsible for the
specified row key.
3. WAL Write: The RegionServer first writes the Put operation to the Write-
Ahead Log (WAL) on HDFS. This ensures durability; if the RegionServer
crashes before the data is persisted to disk, the WAL can be replayed to
recover the data. This step is critical for data safety.
4. MemStore Write: After successfully writing to the WAL, the data is
written to the in-memory store called the MemStore for the
corresponding region. The data in the MemStore is sorted by row key,
column family, column, and timestamp.
5. Acknowledgement: Once the data is in the MemStore and successfully
written to the WAL, the RegionServer acknowledges the write to the
client.
6. MemStore Flush: As the MemStore fills up (reaching a configured size),
the RegionServer flushes its contents to a new file on HDFS, called an
HFile. This is a sequential write operation and is relatively fast.
7. Compaction: Over time, many small HFiles are created by flushes. HBase
automatically runs compactions (minor and major) to merge these
HFiles into larger, more optimized ones, reducing the number of files to
check during reads and improving performance.

Figure 4.2: HBase Write Path

+----------+ +-----------------------+
| Client |----->| HBase Client Library |
| (Put Req)| | (Finds RegionServer) |
+----------+ +-----------------------+
| | Request for RowK
v v
+---------------------------------+
| RegionServer |
| (Hosts Region for RowK) |
|---------------------------------|
| 1. Write to WAL (on HDFS) ----> | +-------+
| 2. Write to MemStore (In-Memory)| | WAL |
| 3. Acknowledge Client <---------| +-------+
| |
| As MemStore fills: |
| 4. Flush MemStore to HFile ---->| +-------+
| | | HFile |
| Over time: | +-------+
| 5. Compact HFiles ------------->| +-------+
+---------------------------------+ | HFile |
| +-------+
v
+---------------------------------+
| HDFS (Stores WALs and HFiles) |
+---------------------------------+

(Note: Data is written to WAL and MemStore before acknowledging


the client. MemStore flushes create HFiles on HDFS.)

Read Operation Process:

1. Client Request: A client application sends a Get request (containing the


row key and optionally column families, columns, and timestamps) to
the HBase cluster.
2. RegionServer Identification: The client library determines which
RegionServer hosts the region for the row key.
3. Multi-Source Read: The RegionServer handles the read by checking
multiple sources for the requested data:
◦ First, it checks the MemStore for the latest data.
◦ Then, it checks relevant HFiles on HDFS, starting with the most
recent ones.
Data in HBase is immutable within HFiles; updates and deletes are
represented by new versions or special tombstone markers. Reads need
to look across potentially multiple HFiles and the MemStore to find the
latest relevant version(s) based on timestamps and filters.
4. Merge and Filter: The results from the MemStore and relevant HFiles are
merged. Filters specified in the Get request are applied.
5. Return Result: The final, merged, and filtered data is returned to the
client.

Figure 4.3: HBase Read Path

+----------+ +-----------------------+
| Client |----->| HBase Client Library |
| (Get Req)| | (Finds RegionServer) |
+----------+ +-----------------------+
| | Request for RowK
v v
+---------------------------------+
| RegionServer |
| (Hosts Region for RowK) |
|---------------------------------|
| 1. Check MemStore (In-Memory) |
| 2. Check HFiles (on HDFS) |
| (Multiple HFiles potentially)|
| | +-------+
| 3. Merge data from MemStore | | HFile1|
(Newest)
| and HFiles | +-------+
| 4. Apply Filters | +-------+
| 5. Return Result <--------------| | HFile2|
+---------------------------------+ +-------+
^ +-------+
| | HFile3|
| +-------+
|
+---------------------------------+
| HDFS (Stores HFiles) |
+---------------------------------+

(Note: Read checks MemStore first, then HFiles. Multiple HFiles


might hold different versions or parts of the data for a row.)

BASIC CHARACTERISTICS OF COLUMN FAMILY STORES


(WIDE-COLUMN STORES)

Column Family Stores, like HBase and Cassandra, represent data


differently from relational or document databases. Their key
characteristics stem from their organization around "column
families".

Basic characteristics include:

• Data Model based on Rows and Column Families: Data is


fundamentally structured around a unique row key. Within
each row, columns are grouped into named "column
families".
• Flexible Columns within Families: Unlike the fixed columns of
an RDBMS table, the specific columns that exist within a
column family can vary dynamically from row to row. You
don't need to define all possible columns upfront for every
row.
• Data Stored by Column Family: Data for a specific column
family for all rows within a partition (like an HBase region) is
typically stored together on disk. This is different from row-
oriented storage (like RDBMS) or storing entire documents
(like Document DBs). This structure is efficient for scanning or
reading data for specific columns across many rows.
• Sparse Data Handling: Because columns are not fixed, rows
do not incur storage overhead for columns that are not
present. This makes them highly efficient for storing sparse
data (data where many potential column values are null or
empty).
• Versioned Data: Typically, multiple versions of a cell's value
are stored, often timestamped. This allows querying historical
values.
• Atomic Operations on Row Key: Reads and writes for a given
row key are often atomic, providing strong consistency
guarantees at the row level (as seen in HBase).
• Optimized for Writes and Scans on Columns: The storage
model makes them performant for ingesting high volumes of
writes and for reading specific columns or column families
across a range of rows.

Figure 4.4: Conceptual Column Family Store Structure


Details

Conceptual View:

Row Key | CF:PersonalInfo |


CF:ContactInfo |
CF:Employment

--------|-------------------------|------
----------------------|--------------
user:1 | Name: "Alice", Age: 30 |
Email: "[email protected]" | Title: "Eng"
user:2 | Name: "Bob" |
Phone: "555-1234", Email:"[email protected]"| Dept:
"Mktg", HireDate: "2020"
user:3 | Name: "Charlie", City:
"NYC" | Email: "[email protected]", Twitter: "@c"|

Physical Storage View (Conceptual -


within a partition/node):

Directory/File for CF:PersonalInfo:


Row Key | Qualifier | Value |
Timestamp | (Optional: Type/Version)

--------|-----------|----------|---------
--|
user:1 | Name | "Alice" |
ts1 |
user:1 | Age | 30 |
ts1 |
user:1 | Name | "Alice" |
ts0 | (Older version)
user:2 | Name | "Bob" |
ts2 |
user:3 | Name | "Charlie"|
ts3 |
user:3 | City | "NYC" |
ts3 |

Directory/File for CF:ContactInfo:


Row Key | Qualifier | Value |
Timestamp

--------|-----------|----------|---------
--
user:1 | Email | "[email protected]"|
ts1
user:2 | Phone | "555-1234"|
ts2
user:2 | Email | "[email protected]"|
ts2
user:3 | Email | "[email protected]"|
ts3
user:3 | Twitter | "@c" |
ts3

Directory/File for CF:Employment:


Row Key | Qualifier | Value |
Timestamp

--------|-----------|----------|---------
--
user:1 | Title | "Eng" |
ts1
user:2 | Dept | "Mktg" |
ts2
user:2 | HireDate | "2020" |
ts2

(Note: Data is grouped by Column Family. Within a CF


file, data is sorted by Row Key, then Column Qualifier
(column name), then Timestamp. Rows don't need to
have all columns.)

APACHE CASSANDRA: DEFINITION AND DATA


PROCESSING (READS, WRITES, UPDATES)

What is Apache Cassandra?

Apache Cassandra is a free and open-source,


distributed, wide-column store NoSQL database
management system designed to handle large amounts
of data across many commodity servers, providing high
availability with no single point of failure. It was
originally developed at Facebook and is now managed
by the Apache Software Foundation. Cassandra is
designed for linear scalability and proven fault tolerance
on commodity hardware or cloud infrastructure.
It uses a peer-to-peer, masterless architecture where all
nodes are the same, contrasting with master-slave
systems like HBase (HMaster) or traditional RDBMS
replication.

How are Reads, Writes, and Update Requests Processed


in Cassandra?

Cassandra's data processing is designed for high write


throughput and continuous availability, often prioritizing
availability over immediate consistency (offering tunable
consistency levels). It uses a Log-Structured Merge Tree
(LSM-Tree) storage engine, similar in principle to HBase
but implemented differently.

Write Operation Process:

1. Client Request: A client sends a write request


(Insert, Update, Delete - treated as Upserts) to any
node in the Cassandra cluster. This node acts as
the "coordinator".
2. Commit Log: The coordinator node first writes the
mutation (the data change) to a commit log on
disk. This ensures durability even if the node
crashes immediately after receiving the write. The
commit log is crucial for data recovery.
3. MemTable Write: The data is then written to an in-
memory structure called a MemTable. The
MemTable is sorted by row key. Writes to the
MemTable are very fast.
4. Replication: The coordinator node forwards the
write request to the appropriate replica nodes
based on the partitioning strategy (consistent
hashing) and replication factor. Replication
happens asynchronously or semi-synchronously
depending on the chosen consistency level for the
write.
5. Acknowledgement: The coordinator node waits for
acknowledgement from a specified number of
replica nodes (determined by the write consistency
level) before confirming the write to the client. For
example, a consistency level of QUORUM requires
acknowledgements from a majority of replicas.
6. MemTable Flush: When a MemTable reaches a
certain size or age, it is flushed to disk as an
immutable file called an SSTable (Sorted String
Table). This is a sequential write.
7. Compaction: SSTables are periodically merged
together through a process called compaction to
combine data, remove tombstone markers (for
deleted data), and improve read performance.

Writes in Cassandra are very fast because they are


primarily sequential appends to the commit log and in-
memory MemTables. Updates and Deletes are treated
as new writes with a later timestamp; the latest
timestamp wins during reads.

Read Operation Process:

1. Client Request: A client sends a read request to


any coordinator node.
2. Coordinator Routing: The coordinator determines
which nodes are replicas for the requested data
based on the row key and partitioning strategy.
3. Query Replicas: The coordinator sends read
requests to one or more replicas based on the
chosen read consistency level.
4. Data Retrieval from Multiple Sources: Each
queried replica node checks its MemTable and
potentially multiple SSTables on disk to find the
relevant data.
5. Merge and Return: Each replica returns the data it
finds (potentially different versions or parts of the
data due to eventual consistency). The coordinator
merges the results from all queried replicas, using
timestamps to reconcile conflicting versions and
determine the most recent data.
6. Consistency Check (Optional): If a consistency level
higher than ONE is used, the coordinator performs
a consistency check on the data received from
replicas. If discrepancies are found (and the
consistency level requires it), a read repair
mechanism might be triggered to update stale
replicas in the background.
7. Return Result: The coordinator returns the final,
consistent (based on the chosen level) result to the
client.

Reads can involve checking multiple SSTables and the


MemTable, which can sometimes make them slower
than writes, especially if many SSTables need to be
examined. Compaction is vital for keeping the number
of SSTables manageable and reads fast.

Update Operation Process:

In Cassandra, there is no explicit "update" operation in


the sense of modifying data in place. Both inserts and
updates are handled as "upserts" (insert or update if
exists). When you "update" a row or cell, Cassandra
writes a *new* value with a *new, later timestamp* to
the commit log and MemTable. The old value is not
immediately overwritten on disk. During reads,
Cassandra returns the value with the latest timestamp
for a given cell. The old values are eventually cleaned up
during the compaction process.

HADOOP AND ITS MAIN COMPONENTS

Apache Hadoop is a collection of open-source software


utilities that facilitates using a network of computers to
solve data-intensive problems involving massive
amounts of data. It provides a software framework for
distributed storage and distributed processing of Big
Data. While HBase is built *on* Hadoop's storage layer
(HDFS), it provides real-time operational access,
complementing Hadoop's traditional strengths in batch
processing and distributed storage.

The main components of Hadoop (often referred to as


the Hadoop ecosystem) historically included:

1. HDFS (Hadoop Distributed File System): The


primary distributed storage layer. It breaks large
files into blocks and distributes them across nodes
in a cluster, replicating blocks for fault tolerance.
It's designed for high-throughput access to large
datasets and streaming reads, not low-latency
random access. (Discussed in more detail below).
2. YARN (Yet Another Resource Negotiator): The
resource management and job scheduling layer. It
is responsible for allocating system resources (CPU,
memory, network) to various applications running
on the Hadoop cluster and scheduling tasks. YARN
allows different processing frameworks (like
MapReduce, Spark, Tez) to run on the same
Hadoop cluster.
3. MapReduce: The original processing framework
for Hadoop. It provides a programming model for
parallel processing of large datasets across a
cluster using Map and Reduce functions. While still
used, it has often been superseded by faster
engines like Spark for many workloads. (Discussed
conceptually in Unit 2).

Beyond these core components, the Hadoop ecosystem


includes many related projects like Hive (data
warehousing/SQL-on-Hadoop), Pig (high-level scripting),
ZooKeeper (coordination), Spark (fast processing), etc.

Figure 4.5: Hadoop Ecosystem (Simplified


Core Components)

+-------------------------------
--------------------+
| Hadoop
Ecosystem |

|-------------------------------
--------------------|
| Client/
Applications
|
+-------------------
+-------------------------------
+
| Job
Submission, Resource Requests
v

+-------------------------------
--------------------+
|
YARN |
| (Resource
Management & Scheduling) |

|-------------------------------
--------------------|
| ResourceManager |
NodeManagers (on each node)|
+---------------------
+-----------------------------+
| Submit/Schedule
Jobs | Launch Containers/
Tasks

v v
+---------------------+
+---------------------+
+---------------------+
| MapReduce |
| Spark | |
Other Apps | ...
| (Processing Engine) | |
(Processing Engine) |
| |
+---------------------+
+---------------------+
+---------------------+

\ /
\ /

\ /
\ /
v
v v v

+-------------------------------
--------------------+
|
HDFS |
| (Distributed
Storage - Data & Metadata) |

|-------------------------------
--------------------|
| NameNode |
DataNodes (on data storage
nodes) |
+---------------------
+-------------------------------
----+
(Metadata, File Tree)
(Stores data blocks, handles
reads/writes)

(Note: YARN manages resources. Processing


engines like MapReduce or Spark run jobs
coordinated by YARN. HDFS provides the
distributed storage for the data being
processed.)

SHORT NOTES ON HDFS AND BIG DATA


ANALYSIS

1) HDFS (Hadoop Distributed File System):

HDFS is the distributed file system that forms


the backbone of Hadoop storage. It is
designed to reliably store very large files
(terabytes to petabytes) across clusters of
commodity servers and provide high-
throughput access. Key characteristics
include:

• Distributed Storage: Files are broken


into large blocks (typically 128MB or
256MB) and distributed across multiple
nodes in the cluster.
• Replication: Each data block is replicated
across several nodes (default 3 copies)
to ensure fault tolerance. If a DataNode
fails, the data is still available from
replicas, and new replicas are created to
restore the desired replication level.
• High Throughput: Optimized for batch
processing and streaming reads of large
files. Not designed for low-latency
random reads or writes (this is where
HBase complements HDFS).
• Write Once, Read Many: Data files are
typically written once in their entirety
and then read multiple times. Updates
to existing files are not supported
efficiently; modifications usually involve
writing a new version of the file.
• Scalability: Can scale horizontally by
adding more DataNodes.
• Master-Slave Architecture: Consists of a
single NameNode (master) that
manages the file system metadata
(directory tree, file permissions, block
locations) and multiple DataNodes
(slaves) that store the actual data blocks
and handle read/write requests from
clients. The NameNode is a critical
component, though high-availability
configurations exist.

HDFS is the reliable storage layer that enables


other Hadoop components and systems like
HBase to process and access massive
datasets.
2) Big Data Analysis:

Big Data analysis refers to the process of


examining large and complex datasets
(characterized by Volume, Velocity, Variety,
and Veracity - the "4 V's") to uncover hidden
patterns, correlations, market trends,
customer preferences, and other useful
information. Traditional data analysis tools
and techniques are often inadequate for Big
Data.

NoSQL databases and the Hadoop ecosystem


play a crucial role in Big Data analysis
because they provide the necessary
infrastructure to:

• Store Massive Volumes: Systems like


HDFS and NoSQL databases (Cassandra,
HBase) can store petabytes of data.
• Handle High Velocity: NoSQL databases
are designed for high-speed data
ingestion from real-time sources.
• Process Diverse Variety: NoSQL's flexible
schemas can accommodate structured,
semi-structured, and unstructured data.
• Enable Distributed Processing:
Frameworks like MapReduce and Spark,
often running on data stored in HDFS or
accessible via NoSQL, process data in
parallel across clusters, enabling
analyses that would be impossible on a
single machine.
• Support Various Analytical Workloads:
From batch processing (MapReduce,
Spark on HDFS/HBase) to real-time
lookups (HBase, Cassandra) and stream
processing (Spark Streaming, Flink), the
ecosystem supports diverse analytical
needs on Big Data.
Big Data analysis is critical for businesses to
gain insights from their increasingly large and
complex data sources, enabling data-driven
decision-making, personalized experiences,
fraud detection, predictive maintenance, and
more.

DIFFERENTIATION BETWEEN BIG DATA


AND APACHE HADOOP

It's important to distinguish between the


concept of "Big Data" and the technology
suite known as "Apache Hadoop".

• Big Data: This refers to datasets that are


so large and complex that traditional
data processing applications are
inadequate. The characteristics of Big
Data are often described by the "4 V's":
◦ Volume: The sheer amount of data.
◦ Velocity: The speed at which data
is generated and needs to be
processed.
◦ Variety: The diverse forms of data
(structured, semi-structured,
unstructured).
◦ Veracity: The quality and accuracy
of the data.
Big Data is the *problem* or the
*phenomenon* of dealing with
increasingly large, fast-moving, and
diverse datasets.
• Apache Hadoop: This is a *specific
open-source framework* designed to
*solve* the problems associated with
processing and storing Big Data. It
provides tools and components (like
HDFS, YARN, MapReduce) for distributed
storage and processing across clusters
of computers.
In essence:

• Big Data is the challenge.


• Hadoop is one of the prominent
solutions (a set of tools and
technologies) used to address the Big
Data challenge.

While Hadoop was one of the earliest and


most influential frameworks for Big Data, the
field of Big Data processing now includes
many other technologies and approaches
(e.g., Apache Spark, various NoSQL
databases, stream processing systems, cloud-
based Big Data services) that can be used
independently or alongside Hadoop.

UNIT 5: VALUE DATABASES USING RIAK


WHAT IS RIAK? FEATURES OF RIAK

Riak is a distributed NoSQL key-value store designed to provide highly


available, fault-tolerant, and scalable storage for large-scale applications.
Initially developed by Basho Technologies, Riak is inspired by Amazon's
Dynamo and emphasizes a masterless, peer-to-peer architecture for
continuous availability and operational simplicity.

Key Features of Riak:

• Masterless Architecture: Every node is equal; there is no single point of


failure. Data is automatically partitioned and replicated across multiple
nodes.
• Consistent Hashing: Data keys are distributed uniformly across the
cluster using consistent hashing, enabling seamless scaling and
rebalancing.
• High Availability & Fault Tolerance: Replication ensures data durability
and system availability even if multiple nodes fail.
• Eventual Consistency: Riak favors availability over strong consistency
but provides tunable consistency and conflict resolution techniques
(e.g., vector clocks, last-write-wins).
• Simple Key-Value Interface: Supports basic operations: PUT , GET ,
and DELETE , with values treated as opaque blobs.
• Scalability: Linear horizontal scalability by adding nodes without
downtime or manual sharding configuration.
• Secondary Indexes & Search: Supports secondary indexes for querying
and integration with full-text search engines like Solr.
• Multi-Datacenter Replication: Supports active-active replication between
geographically distributed clusters.

FEATURES OF KEY-VALUE STORES RELEVANT TO RIAK

Riak belongs to the key-value store family. General features common to key-
value datastores include:

• Simplicity: Data stored as key-value pairs; the database treats the value
as an opaque byte array or blob.
• Fast Lookups: Operations are optimized for rapid retrieval by key,
enabling low latency.
• High Write Throughput: Suitable for workloads requiring high volumes
of write operations.
• No Complex Querying: Limited querying capabilities other than by key
or secondary indexes if supported.
• Distribution & Partitioning: Employs partitioning (often via consistent
hashing) to distribute data evenly across nodes and facilitate scaling.
• Replication: Data is replicated to multiple nodes to ensure durability and
availability.
• Eventual Consistency: Many key-value stores, including Riak, adopt
eventual consistency to maximize availability and partition tolerance.
• Flexible Data Model: Values can contain any type of data – serialized
objects, JSON, binary files, etc.

USE CASES OF RIAK

Riak’s design suits various real-world applications where high availability, fault
tolerance, and horizontal scalability are paramount. Some common use cases
include:

• Session Management: Storing user sessions for web applications to


handle millions of concurrent users with low latency and fault tolerance.
• Shopping Cart Data: Keeping shopping cart states in e-commerce sites,
ensuring data persistence and availability even in case of node failures.
• Social Media and Messaging: Handling large volumes of user-generated
content such as messages, statuses, and notifications.
• Real-Time Analytics & Metrics: Collecting and storing metrics and logs
that require rapid ingestion and availability.
• Content Management: Managing metadata for digital content
repositories where high read/write throughput is needed.
• Distributed Caching Layer: Acting as a highly available cache layer to
offload relational databases.
• Mobile Backend Storage: Providing a scalable and resilient backend for
mobile applications.

ARCHITECTURE OF RIAK AND CLUSTER WORKING MECHANISM

Riak's architecture is based on a fully decentralized, masterless, peer-to-peer


clustered system that makes it extremely resilient and scalable. Below is an
explanation with a diagram.

Main Architectural Components:

• Nodes: Individual Riak servers/nodes forming a homogeneous cluster.


• Ring/Consistent Hashing: Riak organizes nodes into a ring using
consistent hashing to evenly distribute keys. Each key falls into a
particular segment of the ring.
• Partitions: The ring is logically partitioned into smaller segments (virtual
nodes or vnodes). Each vnode is assigned to a physical node.
• Replication: Each key’s data is replicated to a number of nodes
(replication factor, typically 3) across distinct vnodes to ensure
availability.
• Gossip Protocol: Nodes communicate cluster membership, node health,
and data states via gossip, promoting self-healing and consistency.
• Vector Clocks and Conflict Resolution: Used for detecting concurrent
writes and resolving conflicts with either last-write-wins or application-
level reconciliation.

Figure 5.1: Riak Distributed Cluster Architecture and Data


Distribution Ring

+---------------------------+
+---------------------------+
| Node 1 | |
Node 2 |
| +------------+ | |
+------------+ |
| | VNode A |<---+ | | | VNode
B |<---+ |
| | Responsible| | | | |
Responsible| | |
| | for Keys K1| | | | | for
Keys K2| | |
| +------------+ | | |
+------------+ | |
+-------------------|------+
+-------------------|-------+
|
|
\ /
\ /
+----------------+
| Riak Ring |
| (Consistent |
| Hashing) |
+----------------+

+---------------------------+
+---------------------------+
| Node 3 | |
Node 4 |
| +------------+ | |
+------------+ |
| | VNode C |<---+ | | | VNode
D |<---+ |
| | Responsible| | | | |
Responsible| | |
| | for Keys K3| | | | | for
Keys K4| | |
| +------------+ | | |
+------------+ | |
+-------------------|------+
+-------------------|-------+
| |
+----------------------+

Data for a key is stored on multiple nodes


(replication factor = 3).
Nodes communicate and replicate data and
cluster state via gossip.

Explanation: The Riak ring partitions the entire keyspace across all nodes
using consistent hashing. Each node manages multiple virtual nodes (vnodes)
representing partitions. When a key-value pair is stored, Riak identifies the
vnode responsible based on hashing the key. The data is then replicated to
multiple nodes responsible for the next vnodes on the ring to provide
redundancy.

STORING SESSION INFORMATION IN KEY-VALUE DATASTORES


USING RIAK

Session management is critical for applications requiring consistent user


experience, such as web applications. Riak excels at storing session
information due to its availability and fault tolerance.

How Session Data is Stored in Riak:

• Key: A unique session identifier (e.g., session ID generated at login).


• Value: Serialized session data, often encoded as JSON, containing user
preferences, authentication tokens, shopping cart contents, and other
temporary data.
• Operations: The application puts the session data into Riak with the
session ID as the key. Subsequent GET operations retrieve session
details.
• Replication: Session data is replicated across multiple nodes, enabling
availability even if one or more nodes fail.
• TTL (Time To Live): Riak supports setting expiration on keys, so stale
sessions can be cleared automatically.

This approach ensures that session information is highly available, scales


seamlessly with user load, and survives node failures without loss of session
state.

Figure 5.2: Riak Session Management Storage

Client Request: Store User Session


---------------------------------
PUT /riak/sessions/{session_id}
Body: {
"user_id": "user123",
"preferences": {...},
"shopping_cart": [...],
"last_active": "2024-06-01T12:30:00Z"
}

Riak Cluster:
+--------+ +--------+ +--------+
| Node 1 | | Node 2 | | Node 3 |
+--------+ +--------+ +--------+
| | |
| Replicates session data across nodes
for fault tolerance
+--------------------------------------->

MULTI-OPERATION TRANSACTIONS IN KEY-VALUE DATASTORES

Definition of Multi-operation Transactions

A multi-operation transaction involves executing multiple read/write


operations as a single atomic unit, ensuring that either all operations are
successfully applied, or none are (atomicity). Traditional relational databases
support ACID transactions natively, but many key-value stores, including Riak,
provide limited or no support for multi-operation, multi-key transactions.

Riak offers limited transaction capabilities, mostly focusing on single-key


operations for atomicity and eventual consistency across replicas.

Performing Multi-operation Transactions in Riak

Since Riak is a distributed, eventually consistent system with a key-value


model, it does not support traditional ACID transactions spanning multiple
keys or objects. Instead, developers achieve transaction-like behavior using
these methods:

• Application-Level Transactions: The application coordinates multiple


operations using compensation logic or rollback mechanisms to handle
failures.
• Conditional Writes and Comparisons: Riak supports conditional updates
per object using mechanisms like vector clocks or causal contexts to
detect concurrent modifications.
• Conflict Resolution: Riak’s vector clocks help reconcile concurrent
conflicting writes automatically or via application logic.
• Batching via Middleware: Client libraries or middleware layers can batch
multiple operations, but cannot guarantee atomicity across different
keys in a distributed cluster.

Thus, multi-operation transactions in Riak are limited to single-key atomicity


and eventual consistency for multi-key operations. Applications needing strict
multi-key transactional guarantees require additional coordination
mechanisms like distributed locks or external transaction managers.

Diagram: Multi-Operation Transaction Handling in Riak

Figure 5.3: Conceptual Multi-Operation Handling in Riak (Non-


Atomic Across Keys)

Application initiates multi-key transaction:


-------------------------------------------
1. PUT Key1 => Value1
----------------------->
(Atomic single-key write)
2. PUT Key2 => Value2
----------------------->
(Atomic single-key write)
3. GET Key3
<-----------------------
(Read operation)
4. Application validates data consistency.
5. On failure, application issues
compensating operations to revert partial
changes.
Note: Each operation is atomic per key but
multi-key atomicity is not guaranteed.

SHORT NOTES ON TRANSACTIONS IN DISTRIBUTED KEY-VALUE


STORES

In distributed key-value stores like Riak, transactions exhibit key differences


compared to RDBMS:

• Limited Atomicity: Atomic operations are generally limited to single


keys.
• Eventual Consistency: Updates propagate asynchronously, leading to
temporary divergent states across replicas.
• Conflict Detection and Resolution: Mechanisms like vector clocks are
used to detect conflicting versions; conflicts may be resolved
automatically or require application intervention.
• No Native Multi-Key Transactions: There is no built-in support for
atomicity across multiple keys or operations.
• Application Coordination Required: Complex transactional logic, if
needed, must be implemented at the application level or with external
coordination protocols.

SHOPPING CART DATA IN NOSQL KEY-VALUE DATABASES USING


RIAK

The shopping cart is a classic example illustrating how session state or user-
specific transient data can be stored in a NoSQL key-value store like Riak.

Approach:

• Key: User-specific cart ID or session ID.


• Value: Serialized cart contents—list of product IDs, quantities, prices,
and metadata stored as a JSON document or binary blob.
• Operations: The application performs PUT requests to update the cart
on each item addition/removal and GET requests to retrieve cart
contents on user interactions.
• Concurrency: Vector clocks track modifications to prevent inconsistent
updates from concurrent sessions or devices.
• Replication: Cart data is replicated for durability and availability.
• Eventual Consistency: Updates are eventually consistent across replicas;
applications handle temporary discrepancies gracefully.

Using Riak for shopping cart data enables a scalable and resilient e-
commerce backend that maintains user state even during failover or network
partitions.

RELEVANT CASE STUDIES OF NOSQL KEY-VALUE STORE (RIAK)

Riak has been adopted by various organizations to meet demanding


availability and scalability requirements. Key case studies include:

• Best Buy: Uses Riak to handle session stores and shopping cart data,
supporting millions of concurrent users with high availability and fault
tolerance.
• Walmart Labs: Adopted Riak for managing large-scale session data,
leveraging its high write availability and simplicity for distributed
environments.
• Mobile & IoT Applications: Multiple IoT platforms utilize Riak for storing
device states and event streams due to its resilient distributed design
and ability to scale.
• Gaming Companies: Employ Riak for storing user profiles, game states,
and leaderboards to maintain real-time responsiveness under heavy
load with fault tolerance.

Each use case benefits from Riak’s eventual consistency model, masterless
architecture, tunable consistency options, and simple key-value interface
suited to storing session-oriented, user-centric, or high-throughput data.

EXPLICIT DEFINITION OF MULTI-OPERATION TRANSACTION IN


RIAK CONTEXT

In the context of Riak and similar distributed key-value stores, a multi-


operation transaction can be defined as:

A logical grouping of multiple, potentially dependent key-value


operations (such as puts, gets, or deletes) intended to occur as a
single atomic unit, such that either all operations succeed together
or none do, while preserving data consistency and isolation across
the involved keys. Due to the distributed, eventually consistent
nature of Riak, true multi-key atomic transactions are not natively
supported. Instead, applications must implement compensating
mechanisms or external coordination to approximate transactional
behavior.

This definition acknowledges that Riak guarantees atomicity and consistency


per individual key but does not provide native ACID transactions spanning
multiple keys.

You might also like