NoSQL Unit 3
NoSQL Unit 3
1. Consistency
2. Transactions
3. Availability
4. Query Features
5. Scaling
1. Consistency in MongoDB :
i) Single-Document Operations and Consistency
• Strong Consistency for Single-Document Operations:
MongoDB ensures strong consistency for operations
on individual documents. When you perform an insert,
update, or delete operation on a single document, that
operation is immediately visible to subsequent reads.
This means that once a write operation is
acknowledged, any read on the same document will
reflect the most recent state.
• Atomicity on Single Documents: MongoDB operations
on a single document are atomic, meaning that all
changes to a document will either be fully applied or
not applied at all, ensuring consistency within that
document.
ii) Multi-Document Transactions
• ACID Transactions: MongoDB supports
multi-document ACID (Atomicity, Consistency,
Isolation, Durability) transactions. This allows
developers to perform multiple operations across
different documents or collections within a single
transaction, ensuring that all operations succeed
or fail together. This feature ensures strong
consistency across multiple documents or
collections.
• Use Cases: Multi-document transactions are
useful in scenarios where related data is spread
across multiple documents or collections, and
consistency across these elements is crucial, such
as in financial applications or complex workflows.
iii) Consistency in Distributed Systems
• Replica Sets: MongoDB achieves high availability and data
redundancy through replica sets. A replica set is a group of
MongoDB servers that maintain the same data set, with
one primary node and multiple secondary nodes.
– Primary Node: The primary node handles all write
operations and synchronizes changes to secondary
nodes.
– Secondary Nodes: Secondary nodes replicate data from
the primary and can be used to serve read operations.
• Read Concern: MongoDB allows you to control the
consistency level of read operations through the
"read concern" setting. Different levels of read
concern provide different consistency guarantees:
– "local": Returns the most recent data available on
the node that receives the read operation, which
might not include the latest data from the primary.
– "majority": Ensures that the read operation
returns data that has been acknowledged by the
majority of nodes in the replica set, providing a
stronger consistency guarantee.
– "linearizable": Ensures that the read operation
reflects the most recent write to the primary
node, offering the highest consistency level.
• Write Concern: MongoDB also allows control over the
consistency of write operations through "write concern"
settings:
– "acknowledged": The write is acknowledged only by the
primary node.
– "majority": The write is acknowledged once the majority
of nodes in the replica set have written the data, ensuring
that the write is durable and consistent across multiple
nodes.
– "journaled": Ensures that the write operation has been
committed to the journal on the primary node, providing
durability and consistency.
• Read Preference: MongoDB allows you to specify read
preferences to determine from which node (primary or
secondary) the data should be read
– Primary: Reads from the primary node, ensuring
the most up-to-date data.
– Primary Preferred: Reads from the primary if
available, but can fall back to secondaries.
– Secondary: Reads from secondary nodes, which
may not have the most recent data.
– SecondaryPreferred: Reads from secondaries if
available, but can fall back to the primary.
– Nearest: Reads from the nearest node based on
network latency, regardless of whether it is
primary or secondary.
iv) Sharding and Consistency
• Sharding: MongoDB supports sharding, where
data is distributed across multiple shards to
handle large datasets and high throughput. In a
sharded cluster, the MongoDB router (mongos)
directs queries to the appropriate shards.
• Consistency Across Shards: MongoDB ensures
that operations within a shard maintain strong
consistency. However, in a distributed
environment, consistency across shards is
managed by the combination of the sharding key,
routing, and read/write concerns.
v) Eventual Consistency and Staleness
• Eventual Consistency: In scenarios where MongoDB is deployed
across multiple data centers or with a write concern that does not
require a majority acknowledgment, there may be a delay before
changes made on the primary node are reflected on secondary
nodes. This can result in eventual consistency, where secondary
nodes eventually converge to the same state as the primary, but
there may be temporary discrepancies.
• Stale Reads: Reading from a secondary node with a "secondary"
or "nearest" read preference might return stale data that hasn’t
yet been updated with the latest changes from the primary node.
vi) Consistency Trade-offs in CAP Theorem
• CAP Theorem: MongoDB, like other distributed systems, must
balance the trade-offs between Consistency, Availability, and
Partition Tolerance (CAP theorem). Depending on the
configuration (e.g., read/write concern, sharding), MongoDB can
be tuned towards stronger consistency or higher availability, but it
cannot guarantee both simultaneously in the presence of network
partitions.
2. Transaction
i. ACID Properties:
• Atomicity: Ensures that all operations within a transaction are
completed successfully. If any operation fails, the entire
transaction is rolled back, leaving the database in its previous
state.
• Consistency: Guarantees that the database remains in a
consistent state before and after the transaction. Any rules or
constraints defined by the database schema or application
logic are enforced throughout the transaction.
• Isolation: Ensures that transactions are executed
independently of each other, preventing concurrent
transactions from interfering with each other. This means the
results of a transaction are not visible to other transactions
until the transaction is completed.
• Durability: Ensures that once a transaction is committed, the
changes are permanent, even in the event of a system crash.
ii. Single-Document Transactions
• Atomic Operations: In many document databases,
operations on a single document are inherently
atomic. This means that when you modify a
document (e.g., update, insert, delete), the
operation is completed fully or not at all.
• Single Document vs. Relational Transactions: In a
traditional relational database, even simple
updates may require multiple operations across
different tables. In a document database, similar
changes can often be done within a single
document, reducing the need for complex
transactions
iii. Multi-Document Transactions
• ACID Transactions Across Multiple Documents: Some
document databases, such as MongoDB (starting from
version 4.0), provide support for ACID transactions
that span multiple documents and collections. This
allows for complex operations that require multiple
documents to be updated together, ensuring that all
documents remain consistent.
• Use Cases: Multi-document transactions are essential
in scenarios where multiple related documents must
be kept in sync. For example, in an e-commerce
application, a transaction might involve updating the
order document, the inventory document, and the
customer’s account balance document simultaneously
iv. Isolation Levels
• Read Uncommitted: Allows a transaction to read data that has
been modified by other transactions but not yet committed. This
can lead to dirty reads, where the data might be rolled back later.
• Read Committed: Ensures that a transaction only reads data that
has been committed by other transactions, avoiding dirty reads
but still allowing non-repeatable reads (where the same query
might return different results if run multiple times during a
transaction).
• Repeatable Read: Ensures that if a transaction reads a document,
subsequent reads of that document within the same transaction
will return the same data. This level prevents non-repeatable
reads but can still allow phantom reads (where new documents
that match the query criteria are inserted by another transaction).
• Serializable: The strictest isolation level, ensuring complete
isolation from other transactions. It prevents dirty reads,
non-repeatable reads, and phantom reads by serializing
transactions, meaning they are executed one after another rather
than concurrently.
v. Implementation in Popular Document Databases
• MongoDB: MongoDB supports ACID transactions across
multiple documents and collections starting from version
4.0. These transactions are implemented using the
startSession and withTransaction commands, allowing
developers to group operations within a transaction.
• Couchbase: Couchbase also supports multi-document ACID
transactions, enabling complex, consistent operations
across multiple documents.
• Amazon DocumentDB: While Amazon DocumentDB
(compatible with MongoDB) supports some transactional
capabilities, it does not offer the full range of ACID
transactions across multiple documents and collections as
native MongoDB does.
• CouchDB: Apache CouchDB traditionally does not support
multi-document transactions but ensures atomicity and
consistency at the single-document level.
6. Handling Transaction Failures
• Rollback: If any operation within a transaction fails, the transaction
can be rolled back to undo all operations, restoring the database to
its previous state.
• Retries: In distributed systems, network issues or temporary
failures might require retrying a transaction. Many document
databases offer built-in support or best practices for handling
retries to ensure transactions are eventually completed.
7. Performance Considerations
• Transaction Overhead: While transactions ensure data
consistency, they also introduce overhead, as the database
must maintain logs and manage potential rollbacks. This can
affect performance, particularly in high-throughput scenarios.
• Batching Operations: To minimize the performance impact,
developers often batch multiple operations into a single
transaction where possible, reducing the number of
round-trips to the database.
8. Use Cases for Transactions
• Financial Applications: Ensuring that multiple
related operations, such as debiting one account
and crediting another, are completed together.
• Inventory Management: Ensuring that stock
levels, order documents, and customer data are
updated consistently when processing orders.
• User Management: Ensuring that changes to user
profiles, roles, and permissions are consistent
across different collections.
• Example: Transaction in MongoDB
• Here’s an example of a multi-document transaction in MongoDB:
• session.startTransaction();
• try {
• db.collection('orders').insertOne(
• { session }
• );
•
• db.collection('inventory').updateOne(
• { session }
• );
•
• db.collection('accounts').updateOne(
• },
• { session }
• );
•
• session.commitTransaction();
• session.endSession();
• } catch (error) {
• session.abortTransaction();
• session.endSession();
• throw error;
• }
• In this example:
• The transaction ensures that either all operations (placing an order, updating inventory, and deducting the user’s balance)
succeed or none of them do.
• If any operation fails (e.g., insufficient inventory), the entire transaction is rolled back, ensuring data consistency.
3. Availability
In document databases, the availability of data
refers to the ability of the system to ensure
that data is accessible to users even in the
presence of failures, such as network issues,
hardware failures, or data center outages.
Document databases are designed with
features and architectures that enhance
availability, making them suitable for use cases
where uptime is critical.
• Factors Affecting Data Availability in Document
Databases:
• Replication:
– Description: Replication involves maintaining multiple
copies of data across different nodes or data centers. If one
node fails, another can serve the data, ensuring continued
availability.
– Types:
• Master-Slave Replication: One node (master) handles all write
operations, and other nodes (slaves) replicate the data and
handle read operations. This setup can lead to some downtime if
the master fails.
• Master-Master Replication: Multiple nodes can handle both read
and write operations. This setup increases availability because
any node can fail without disrupting the entire system.
– Examples: MongoDB supports replica sets, Couchbase
provides cross-datacenter replication (XDCR), and CouchDB
has multi-master replication.
• Sharding:
– Description: Sharding involves distributing data
across multiple servers, or shards, to improve
scalability and availability. Each shard contains a
portion of the data, and they work together to
handle queries.
– Impact on Availability: Sharding allows the system
to handle large datasets and high throughput. If a
shard fails, the system can route requests to other
shards, minimizing downtime.
– Examples: MongoDB uses sharding to distribute
data across clusters; ArangoDB also supports
sharding for distributing documents.
• Fault Tolerance:
– Description: Fault tolerance is the system’s ability
to continue operating despite failures. Document
databases achieve fault tolerance through data
redundancy, automated failover, and self-healing
mechanisms.
• Examples: In MongoDB, when a primary node
in a replica set fails, an automatic election
occurs to promote a secondary node to
primary. Couchbase provides automatic
failover to maintain service availability.
• Consistency Models:
– Eventual Consistency: Some document databases opt
for eventual consistency, where all replicas eventually
reflect the most recent write. This model improves
availability since reads can be served from any replica.
– Strong Consistency: In systems requiring strict data
accuracy, strong consistency ensures that a read
operation always returns the latest write. This model
might reduce availability during network partitions,
but some document databases offer configurable
consistency levels.
– Examples: MongoDB allows configuring the
consistency level with read preferences and write
concerns. Couchbase and Amazon DocumentDB also
offer adjustable consistency settings.
• Distributed Architecture:
– Description: Document databases often use a distributed
architecture, where data is spread across multiple nodes
and locations. This distribution enhances availability by
eliminating single points of failure.
– Examples: MongoDB and Couchbase both employ
distributed architectures that support high availability
through node distribution.
• Automatic Failover:
– Description: Automatic failover ensures that when a node
or a database instance fails, another node takes over
automatically, maintaining the availability of the data.
– Examples: MongoDB’s replica sets automatically elect a
new primary node in case of failure, Couchbase has
automatic failover for its nodes, and Amazon DocumentDB
provides managed failover as part of its AWS service.
• Backup and Restore:
– Description: Regular backups ensure that data can be
recovered in case of catastrophic failures. Many document
databases offer automated backup solutions that can be
restored to ensure data availability.
– Examples: Amazon DocumentDB provides automated
backups and point-in-time recovery, while MongoDB Atlas
offers continuous backups.
Considerations for High Availability:
• Network Latency: While replication across geographically distant
locations can improve availability, it can also introduce latency. The
architecture needs to balance these factors.
• Read/Write Preferences: Configuring how and where data reads
and writes occur can impact both availability and performance.
For instance, directing reads to replicas can improve availability
but may result in stale data.
• Data Center Redundancy: Deploying databases across multiple
data centers or availability zones ensures that even if one data
center fails, the others can continue to serve data.
• Examples of Availability in Popular Document
Databases:
• MongoDB: Uses replica sets for high availability,
with automated failover and support for
multi-region deployments. Sharding is available
for horizontal scaling.
• Couchbase: Offers cross-datacenter replication
(XDCR) for disaster recovery and high availability.
It also provides automatic failover and load
balancing.
• CouchDB: Focuses on distributed, multi-master
replication, making it resilient to network
partitions and node failures.
4. Query features
When querying in document databases, there are several
features that allow for efficient and complex data
retrieval.
i. Rich Query Language:
• Field-Based Queries: You can query documents based
on specific fields within the documents. This includes
querying for exact matches, ranges, and pattern
matches.
• Nested Document Queries: You can query nested
fields within documents, allowing for deep retrieval of
information stored in sub-documents.
• Array Queries: Support for querying elements within
arrays, including checking for the presence of certain
values or the size of the array.
ii. Aggregation Framework:
• Aggregation Pipelines: Allows for data processing through
stages such as filtering, grouping, sorting, and transforming
data. This is useful for generating reports and insights from the
data.
• Map-Reduce: Some document databases support map-reduce
operations, enabling large-scale data processing tasks.
iii. Indexing:
• Single Field Indexes: Indexing on specific fields to optimize
query performance.
• Compound Indexes: Multi-field indexes that allow for faster
queries on multiple criteria.
• Text Indexes: Full-text search capabilities for querying text
fields within documents.
• Geospatial Indexes: Specialized indexes for querying
geospatial data, such as finding documents within a certain
geographic radius.
iv. Flexible Query Options:
• Query Operators: Support for a variety of operators,
including comparison operators ($eq, $gt, $lt), logical
operators ($and, $or, $not), and element operators ($exists,
$type).
• Regular Expressions: Allows for pattern matching within
string fields, useful for more complex search criteria.
• Projection: You can specify which fields to include or exclude
in the query results, reducing the amount of data returned.
v. Real-Time Queries:
• Change Streams: Some document databases offer the ability
to listen to real-time changes in the data, allowing for
reactive applications that update based on data changes.
• TTL Indexes: Time-to-Live indexes automatically remove
documents after a certain period, allowing for time-based
queries to retrieve only current or relevant data.
vi. Joins and Lookup:
• Embedded Joins: Document databases often use embedded
documents to avoid the need for joins, but some support a lookup
operation to join data from different collections.
• Cross-Collection Queries: In some document databases, you can
perform queries that combine data from multiple collections,
similar to SQL joins.
vii. Pagination and Sorting:
• Limit and Skip: Query results can be paginated using limit and skip
parameters, which is useful for handling large datasets in chunks.
• Sorting: Results can be sorted by one or more fields, in ascending or
descending order.
viii. Faceted Search:
• Facets: Some document databases support faceted search, which
allows for filtering and categorizing results based on various criteria,
commonly used in e-commerce and search applications.
ix. Text Search:
• Full-Text Search: Document databases may offer
full-text search capabilities with options for relevance
scoring, stemming, and tokenization.
• Wildcard and Fuzzy Search: Enables more flexible text
queries, such as searching for words with similar
spelling or partial matches.
x. Security and Access Control:
• Role-Based Access: Query permissions can be
controlled based on user roles, ensuring that only
authorized users can execute certain queries.
• Field-Level Encryption: Certain fields can be encrypted
and queried securely, protecting sensitive data.
5. Scaling
Scaling a MongoDB database involves addressing both horizontal and
vertical scaling strategies to ensure your database can handle
increased load and data volume effectively.
1. Vertical Scaling
• Vertical scaling involves increasing the resources (CPU, RAM,
storage) of a single MongoDB server.
• This approach can be effective for smaller-scale deployments or
for instances where the data volume and load are manageable.
Pros:
• Simpler to implement compared to horizontal scaling.
• Requires less reconfiguration of the database setup.
Cons:
• There's a physical limit to how much you can scale a single
server.
• Can become expensive and may not address issues related to
high availability.
2. Horizontal Scaling
• Horizontal scaling involves adding more servers to distribute
the load and data. MongoDB supports horizontal scaling
through sharding.
Sharding
• Sharding is the process of distributing data across multiple
servers (shards). This allows MongoDB to handle larger
datasets and more operations by spreading the load.
Key Components:
• Shards: Individual MongoDB servers or replica sets that hold
a portion of the data.
• Config Servers: Manage metadata and configuration settings
for the cluster.
• Query Routers (mongos): Direct queries to the appropriate
shards based on the data distribution.
Steps to Implement Sharding:
1. Choose a Shard Key:
• Select a field or set of fields that will be used to
distribute data across shards.
• The choice of shard key is crucial as it affects the
performance and balance of data distribution.
2. Set Up Config Servers:
• Deploy and configure the config servers, which store
metadata about the sharded cluster.
3. Deploy Shards:
• Set up and configure the MongoDB instances that will
act as shards.
4. Configure Mongos Routers:
• Set up mongos instances to route client requests to
the appropriate shards
• 5. Enable Sharding on Databases and Collections:
• Use the shardCollection command to enable
sharding on specific collections within the
database.
Pros:
• Scales out by adding more servers, which can be
cost-effective and flexible.
• Improves availability and fault tolerance.
Cons:
• More complex to set up and manage.
• Requires careful planning of shard key to avoid
issues like hotspotting (where a single shard
handles too much traffic).
3. Replication
• In addition to scaling, ensure that you have proper replication in place for
high availability and fault
• tolerance. MongoDB uses replica sets to provide redundancy and data
availability.
• Key Components:
• Primary: The node that receives all write operations.
• Secondaries: Nodes that replicate data from the primary and can serve
read requests
• (depending on the read preference settings).
• Pros:
• Provides automatic failover in case the primary node fails.
• Improves read availability by allowing read operations to be served by
secondary nodes.
• Cons:
• Adds complexity to the setup and requires regular maintenance.
• Write operations are only performed on the primary node, which can be
a bottleneck if not
• properly scaled.
4. Monitoring and Maintenance
• Regardless of your scaling strategy, effective monitoring and maintenance are crucial.
Use tools like
• MongoDB Atlas (for managed deployments) or open-source monitoring solutions to
track
• performance, resource usage, and potential issues.
• Key Metrics to Monitor:
• Query performance and latency
• Resource utilization (CPU, memory, disk I/O)