MongoDB Updated
MongoDB Updated
Introduction to MongoDB
What is MongoDB?
Features of MongoDB
Types of Databases (SQL vs NoSQL)
Why does MongoDB use BSON?
BSON Advantages
Alternatives to MongoDB (Cassandra, Redis, DynamoDB, HBase, OrientDB)
Collections
Insert vs Save
Update vs UpdateOne vs UpdateMany
Delete Operations (DeleteOne, DeleteMany)
Basic Query Operations (find, findOne)
Cursors
Admin Database
How to List Collections
How to Modify a Collection Name(db.collection.renameCollection())
4. Indexing in MongoDB
What is Indexing?
Single Field Index
Compound Index
Multi-Key Index
Geospatial Index
Text Index
Hashed index
Covered Queries
How to Create Indexes (db.collection.createIndex())
Indexing Best Practices
Clustered Index vs Non-Clustered Index
Clustered Collections
5. Aggregation Framework
6. Data Modeling
9. Replication
What is Replication?
Primary and Secondary Replica Set
How Many Nodes in a Replica Set?
Voting in Replication
Difference Between GridFS and Sharding
10. Sharding
What is Sharding?
Components of Sharding
Query Routing in Sharding
Advantages and Disadvantages of Sharding
Sharding vs Replication
Sharding Best Practices
CAP Theorem
Capped Collections
How to Create a Capped Collection
Sharding Disadvantages
11. GridFS
What is GridFS?
Difference Between GridFS and Sharding
GridFS vs Traditional File Storage
Transactions in MongoDB
ACID Compliance
Batch Sizing
Upsert Operations
Use Cases for Transactions
CAP Theorem
TTL (Time to Live)
Data Redundancy
Clustered Collections
Materialized Views
View collections
Decrement Operations
Alternatives to MongoDB
o Cassandra
o Redis
o DynamoDB
o HBase
o OrientDB
What is MongoDB?
MongoDB is a modern, open-source NoSQL database that handles lots of unstructured data.
Instead of using tables like traditional databases, it stores data in flexible, JSON-like
documents called BSON. This means you can easily change the structure of your data
without any issues. MongoDB is great for applications that need to process large amounts
of data quickly, like real-time analytics and big data projects. It’s also easy to use and works
well with modern development tools, making it a popular choice for developers.
Features of MongoDB
SQL Databases
Structured Data: SQL databases store data in tables with rows and columns, similar to a
spreadsheet.
Fixed Schema: You need to define the structure of your data (schema) before you can store it.
Relational: Data is organized in a way that allows relationships between different tables.
ACID Compliance: Ensures reliable transactions with properties like Atomicity, Consistency,
Isolation, and Durability.
Vertical Scalability: Typically scaled by increasing the power of a single server (e.g., adding more
CPU, RAM).
Examples: MySQL, PostgreSQL, Oracle, SQL Server.
NoSQL Databases
Flexible Data: NoSQL databases store data in various formats like documents, key-value pairs,
graphs, or wide-columns.
Schema-less: You don’t need to define the structure of your data in advance, allowing for more
flexibility.
Non-Relational: Data is often stored without strict relationships, making it easier to handle
unstructured data.
High Scalability: Designed to scale out horizontally by adding more servers.
Eventual Consistency: Some NoSQL databases prioritize availability and partition tolerance over
immediate consistency.
Key Differences
Structure: SQL uses structured tables, while NoSQL uses flexible formats.
Schema: SQL requires a predefined schema; NoSQL does not.
Scalability: SQL scales vertically; NoSQL scales horizontally.
Use Cases: SQL is great for complex queries and transactions; NoSQL is ideal for large volumes of
unstructured data and real-time applications.
“MongoDB uses BSON (Binary JSON) because it is a binary format that is more efficient for
storage and retrieval, supports a wider range of data types, and allows for faster parsing
and flexibility in representing complex data structures.”
1. BSON is a binary format, which means it can store data more compactly than plain
text JSON. This helps in saving storage space.
2. BSON is designed to be fast to encode and decode. This makes data retrieval and
storage operations quicker.
3. BSON supports more data types than JSON, such as dates and binary data.This
allows MongoDB to handle a wider variety of data efficiently.
4. BSON is designed to be traversable, meaning MongoDB can easily navigate
through the data to perform operations like queries and indexing.
5. BSON maintains the order of keys in documents, which can be important for
certain applications
BSON Advantages
Advantages of BSON
1. Efficiency: BSON is a binary format, which makes it faster to read and write compared to text-
based formats like JSON
2. Compactness: It generally results in smaller file sizes, saving storage space and improving
transmission speeds
3. Rich Data Types: BSON supports a wider range of data types, including dates and binary data,
which JSON does not
4. Speed: The binary encoding allows for quicker parsing and efficient data traversal
5. Flexibility: It supports nested documents and arrays, making it easier to represent complex data
structures
Disadvantages of BSON
1. Space Efficiency: While BSON is compact, it can sometimes be less space-efficient than JSON due
to additional metadata.
2. Human Readability: BSON is not human-readable, which can make debugging and manual data
inspection more challenging.
3. Complexity: The binary format can be more complex to work with compared to the simpler, text-
based JSON.
MongoDB supports a variety of data types to handle different kinds of information. Here are some
of the key data types:
These data types allow MongoDB to handle a wide range of data and provide flexibility in how you
store and manage your information.
An ObjectId in MongoDB is a unique identifier for documents. It is 12 bytes in size and consists of
the following components:
4-byte Timestamp: Represents the creation time of the ObjectId, measured in
seconds since the Unix epoch.
5-byte Random Value: Generated once per process, unique to the machine
and process.
3-byte Counter: An incrementing counter, initialized to a random value
Embedded Documents
Embedded documents in MongoDB are documents stored within other documents, creating a
nested structure. This approach is useful for storing related data together, making it easier to access
and manage.
Example:
Imagine you have a user document that includes the user’s address. Instead of storing the address
in a separate collection, you can embed it directly within the user document:
JSON
{
"_id": 111111,
"email": "[email protected]",
"name": {
"given": "Jane",
"family": "Han"
},
"address": {
"street": "111 Elm Street",
"city": "Springfield",
"state": "Ohio",
"country": "US",
"zip": "00000"
}
}
Benefits:
When to Use:
When the embedded data grows too large, making the document unwieldy.
When the data has complex relationships that are better managed with references.
3. Basic MongoDB Operations
Collections
Insert vs Save
o Insert: Adds a new document to a collection. If the document already exists, it will
not be added again.
o Save: If the document has an _id field and it matches an existing
document, save will update that document. If there’s no match, it will insert the
document as a new one.
Comparison Operators
1. $eq: Matches values that are equal to a specified value.
2. $ne: Matches values that are not equal to a specified value.
3. $gt: Matches values that are greater than a specified value.
4. $gte: Matches values that are greater than or equal to a specified value.
5. $lt: Matches values that are less than a specified value.
6. $lte: Matches values that are less than or equal to a specified value.
7. $in: Matches any of the values specified in an array.
8. $nin: Matches none of the values specified in an array.
Logical Operators
1. $and: Joins query clauses with a logical AND, returning all documents that
match the conditions of both clauses.
2. $or: Joins query clauses with a logical OR, returning all documents that match
the conditions of either clause.
3. $not: Inverts the effect of a query expression and returns documents that do
not match the query expression.
4. $nor: Joins query clauses with a logical NOR, returning all documents that fail
to match both clauses.
Element Operators
1. $exists: Matches documents that have the specified field.
2. $type: Matches documents that have a field of the specified type.
Evaluation Operators
1. $regex: Matches documents where the value of a field matches a specified regular
expression.
2. $expr: Allows the use of aggregation expressions within the query language.
3. $jsonSchema: Validates documents against the given JSON Schema.
Array Operators
1. $all: Matches arrays that contain all elements specified in the query.
2. $elemMatch: Matches documents that contain an array field with at least one
element that matches all the specified query criteria.
3. $size: Matches any array with the specified number of elements.
Geospatial Operators
1. $geoWithin: Selects documents with geospatial data that exist entirely within a
specified shape.
2. $geoIntersects: Selects documents with geospatial data that intersect with a specified
shape.
3. $near: Returns documents in order of proximity to a specified point.
Cursors
A cursor is an object that allows you to iterate over the results of a query. When you
use find, it returns a cursor, which you can use to access each document one by one.
Admin Database
To list all collections in a database, you can use the listCollections command or
the show collections command in the MongoDB shell.
db.oldCollectionName.renameCollection("newCollectionName")
4. Indexing in MongoDB
What is Indexing?
Indexes are special data structures that store a small portion of the collection’s data in
an easy-to-traverse form. They are similar to the index in a book, which helps you quickly
find the information you need without having to read through the entire book.
Purpose: They make it faster to retrieve documents from a collection by reducing the
amount of data MongoDB needs to scan.
Types: MongoDB supports various types of indexes, including single field, compound,
multi-key, text, and geospatial indexes.
Creation: You can create an index on a collection using the createIndex method.
Usage: When you query a collection, MongoDB uses the index to quickly locate the
required documents.
For example, if you have a collection of books and you frequently search by the author’s
name, you can create an index on the author field to speed up these queries.
Default _id Index:
every collection automatically has a default index on the _id field. This index is created
when the collection is created and ensures that each document in the collection has a
unique identifier.
Compound Index
o A compound index is an index on multiple fields. This is useful for queries that
filter on multiple fields. For example:
Multi-Key Index
o A multi-key index is used for indexing fields that hold arrays. MongoDB creates
an index entry for each element of the array. For example:
db.collection.createIndex({ tags: 1 })
Geospatial Index
o A geospatial index is used for querying geospatial data. MongoDB supports 2D
and 2DSphere indexes for different types of geospatial queries. For example:
db.collection.createIndex({ location: "2dsphere" })
Text Index
o A text index is used for text search queries. It indexes the content of string fields
for efficient text search. For example:
Covered Queries
A covered query is a query where all the fields in the query are part of an index. This means
MongoDB can satisfy the query using only the index, without scanning any documents.
This can significantly improve performance.
o Analyze Query Patterns: Create indexes based on the fields that are frequently
queried.
o Limit the Number of Indexes: Each index consumes disk space and affects write
performance.
o Use Compound Indexes Wisely: Ensure the order of fields in compound indexes
matches the query patterns.
o Monitor Index Usage: Use tools like MongoDB Atlas Performance Advisor to
monitor and optimize index usage.
o Clustered Index: MongoDB does not support clustered indexes in the traditional
sense. However, the _id field in MongoDB is automatically indexed and can be
considered similar to a clustered index.
o Non-Clustered Index: All other indexes in MongoDB are non-clustered. They store
a reference to the actual data rather than the data itself.
5. Framework
1. Aggregation Pipeline:
o The aggregation framework works like a pipeline where data passes
through various stages, with each stage performing an operation on the
data.
o The output of one stage becomes the input for the next stage.
Basic Aggregation Pipeline Stages:
1. $match:
db.user.aggregate([
{
$match: { "country": "USA" } // Filters documents where
'country' is 'USA'
}
])
2. $group:
The $group stage groups documents by a specified field or fields and performs
aggregation operations such as $sum, $avg, $max, etc., on each group.
Example: Group users by favoriteFruit and calculate the total count of users
in each group.
db.user.aggregate([
{
$group: {
_id: "$favoriteFruit", // Group by 'favoriteFruit'
count: { $sum: 1 } // Count the number of users
in each group
}
}
])
3. $sort:
db.user.aggregate([
{
$sort: { "age": -1 } // Sort documents by 'age' in descending
order
}
])
4. $project:
db.user.aggregate([
{
$project: {
name: 1, // Include the 'name' field
age: 1 // Include the 'age' field
}
}
])
db.user.aggregate([
{
$project: {
name: 1, // Include 'name'
ageInFiveYears: { $add: ["$age", 5] } // Add 5 to the 'age'
field and store it as 'ageInFiveYears'
}
}
])
5. $limit:
The $limit stage restricts the number of documents passed to the next stage of
the pipeline. It is useful when you need to return a specific number of
documents, such as in pagination.
Example: Limit the result to the first 5 documents.
db.user.aggregate([
{
$limit: 5 // Return only the first 5 documents
}
])
$or: Matches documents where at least one of the conditions in the array is true.
o Example: Find users who are either from the USA or are older than 30.
db.user.find({
$or: [
{ country: "USA" },
{ age: { $gt: 30 } }
]
})
$in: Matches any documents where the field’s value is in the specified array.
o Example: Find users whose favorite fruit is either "Apple" or "Banana."
db.user.find({
favoriteFruit: { $in: ["Apple", "Banana"] }
})
$exists: Checks if a field exists in a document.
o Example: Find users that have an email field.
db.user.find({
email: { $exists: true }
})
2. $facet
$facet: Allows running multiple aggregation pipelines within a single query and
outputs a documentcur containing the results of all pipelines.
o Example: Run two facets: one to group by favoriteFruit and count
users, and another to get the average age of users.
db.user.aggregate([
{
$facet: {
fruitCounts: [
{ $group: { _id: "$favoriteFruit", count: { $sum:
1 } } }
],
averageAge: [
{ $group: { _id: null, avgAge: { $avg: "$age" } }
}
]
}
}
])
3. $lookup
$lookup: Performs a left outer join between two collections. Useful for joining
data from different collections in MongoDB.
o Example: Join orders collection with users collection to include user
details in each order.
db.orders.aggregate([
{
$lookup: {
from: "users", // The collection to join
localField: "userId", // Field from 'orders'
foreignField: "_id", // Field from 'users'
as: "userDetails" // Name for the resulting
joined field
}
}
])
4. $merge
db.orders.aggregate([
// ... Your aggregation pipeline ...
{ $merge: "aggregatedResults" } // Merge the output into
'aggregatedResults'
])
5. $unwind
db.user.aggregate([
{ $unwind: "$hobbies" }
])
$addToSet: Adds a value to an array, only if the value does not already exist in
the array (like a set).
o Example: Add a hobby to a user’s hobbies array, only if it doesn’t
already exist.
db.user.updateOne(
{ _id: userId },
{ $addToSet: { hobbies: "reading" } })
db.user.updateOne(
{ _id: userId },
{ $push: { hobbies: "gaming" } }
)
db.user.updateOne(
{ _id: userId },
{ $pull: { hobbies: "gaming" } }
)
$all: Matches documents where the array field contains all the specified
elements.
o Example: Find users whose hobbies include both "reading" and
"traveling."
db.user.find({
hobbies: { $all: ["reading", "traveling"] }
})
$nin: Matches documents where the field’s value is not in the specified array.
o Example: Find users whose favorite fruit is neither "Apple" nor
"Banana."
db.user.find({
favoriteFruit: { $nin: ["Apple", "Banana"] }
})
$ne: Matches documents where the field’s value is not equal to the specified
value.
o Example: Find users who do not live in the USA.
db.user.find({
country: { $ne: "USA" }
})
8. $cond, $expr
db.user.aggregate([
{
$project: {
name: 1,
status: {
$cond: { if: { $gte: ["$age", 18] }, then: "Adult",
else: "Minor" }
}
}
}
])
db.user.find({
$expr: { $gt: ["$age", "$yearsOfExperience"] }
})
Map-Reduce
How it works: Map-Reduce involves two functions: map and reduce. The map function processes
each document and emits key-value pairs. The reduce function then processes these pairs to
aggregate the results.
Flexibility: It allows for complex operations using JavaScript, making it highly flexible.
Performance: Generally slower and less efficient compared to the Aggregation Framework,
especially for large datasets.
Use Cases: Suitable for complex data processing tasks that require custom JavaScript functions.
db.sales.mapReduce(
{ out: "total_sales_by_category" }
);
Aggregation Framework
How it works: Uses a pipeline of stages to process data. Each stage transforms the documents as
they pass through the pipeline.
Built-in Operators: Includes a variety of built-in operators for filtering, grouping, sorting, and
transforming data.
Performance: More efficient and faster than Map-Reduce, especially for large datasets.
Use Cases: Ideal for most aggregation tasks due to its performance and ease of use.
db.sales.aggregate([
]);
When to Use Map-Reduce vs. Aggregation Framework
Conclusion
Both Map-Reduce and the Aggregation Framework are powerful tools for data
aggregation in MongoDB. The choice between them depends on your specific use case.
For most standard data processing tasks, the Aggregation Framework is the better
option due to its performance and ease of use. However, for more complex or highly
customized data transformations, Map-Reduce may still be the appropriate choice.
Covered Query
A Covered Query in MongoDB is a type of query where MongoDB can get all the
information it needs from the index itself, without having to look at the actual documents in
the collection. This makes the query much faster because MongoDB doesn't need to read any
extra data from the disk.
1. All the fields used in the query must be part of the index.
2. The query only asks for fields that are in the index (no extra fields).
3. The index is used for filtering, sorting, and retrieving the results.
Example:
Let's say you have a collection called users, and each document looks like this:
{
"name": "Alice",
"age": 25,
"email": "[email protected]"
}
This query:
Filters by name.
Projects (returns) only the name and age fields.
Since both name and age are part of the index, MongoDB can get the results directly from the
index without reading the full document. This makes the query a covered query.
Why is it good?
Faster queries: Since MongoDB doesn’t need to fetch the actual documents, it saves time.
Less data to process: MongoDB only works with the index, so it's quicker and uses fewer
resources.
6. Data Modeling
Relational: Data is stored in separate tables, and relationships are defined using
foreign keys. Think of it like a spreadsheet where each sheet is a table, and you link
them using unique IDs.
Embedded: Data is stored within a single document. It’s like having all related
information in one place, like a nested list or a JSON object.
Normalization vs Denormalization
Normalization: Splitting data into multiple tables to reduce redundancy. It’s like
organizing your files into different folders to avoid duplicates.
Denormalization: Combining related data into a single table to improve read
performance. It’s like putting all your important documents in one folder for quick
access.
When to Use Embedded Documents
CRUD stands for Create, Read, Update, and Delete—the four basic operations of
persistent storage in a database. In MongoDB, these operations are performed on
documents within collections.
BulkWrite Operations
BulkWrite operations allow you to perform multiple write operations (insert, update,
delete) in a single request. This can improve performance when dealing with large
numbers of documents.
db.collection.bulkWrite([
{ insertOne: { document: { name: "Bob", age: 30 } } },
{ updateOne: { filter: { name: "Alice" }, update: { $set: { age: 26
} } } },
{ deleteOne: { filter: { name: "John" } } }
]);
Upsert Operation
db.collection.updateOne(
{ name: "Charlie" },
{ $set: { age: 28 } },
{ upsert: true }
);
In MongoDB, save() is a convenient way to perform both insert and update operations,
but it has been deprecated in favor of using insertOne() and updateOne() for clarity.
Aggregation in CRUD
db.collection.aggregate([
{ $match: { age: { $gt: 20 } } },
{ $group: { _id: "$age", totalUsers: { $sum: 1 } } },
{ $sort: { totalUsers: -1 } }
]);
In this example:
This is a powerful way to perform operations like filtering, grouping, and sorting in a
single query.
The $regex operator allows you to search for strings that match a particular pattern, defined
using a regular expression (regex). It’s particularly useful for partial string matches or more
complex text searches.
Example:
^A: Matches any string that starts with the letter "A".
i: Case-insensitive matching.
The $expr operator allows you to use aggregation expressions within the find query. It’s
useful when you need to compare fields within a document or perform calculations.
Example:
$gt: Checks if age is greater than score within the same document.
You can use any aggregation expression with $expr, including $add, $subtract, $and, etc.
The $elemMatch operator is used to match documents that contain an array field, where at
least one element in the array matches the specified condition(s).
Example:
Find all users who have a score array with at least one score greater than 80:
In this example, MongoDB will return documents where the scores array has at least one
element greater than 80.
This will return documents where the scores array has at least one element that is greater
than 80 but less than 90.
The $exists operator checks whether a particular field exists in a document. It’s useful for
finding documents that either have or lack a specific field.
Example:
$exists: false: The field phone must be absent from the document.
9. Replication
Replication is achieved through Replica Sets, which are groups of MongoDB servers that
maintain the same data set. Replica Sets provide automatic failover, data redundancy, and
recovery options.
Primary: The primary node in a replica set is the main server that receives all write
operations. It accepts updates, inserts, and deletes, and replicates these changes to the
secondary nodes. Applications connect to the primary node for all write operations.
Secondary: Secondary nodes replicate data from the primary node. They hold read-
only copies of the data and can be used for read operations, improving query
performance. Secondary nodes help distribute the read load and act as backups in case
the primary node fails.
When the primary node fails, one of the secondary nodes is automatically elected as the new
primary.
Replication in MongoDB is a process that ensures data is copied and maintained across multiple
servers. This helps in achieving high availability and data redundancy, meaning your data is safe
even if one server fails.
Simple Explanation:
Replication: The process of copying data from one MongoDB server (primary) to other servers
(secondaries).
Replica Set: A group of MongoDB servers that maintain the same data set. It includes one
primary node and multiple secondary nodes.
How It Works:
Example:
Let’s say you have a users collection in your MongoDB database. You set up a replica set with
one primary and two secondary nodes.
Real-Life Example:
Imagine a popular e-commerce website. To ensure that the website remains available even
during server failures, the company uses MongoDB replication. They set up a replica set with
servers located in different geographical regions. This way, if one server fails due to a hardware
issue or a natural disaster, another server in a different location can take over, ensuring that
customers can still access the website and make purchases.
However, a replica set can have up to 50 nodes, with a maximum of 7 voting members. The
number of nodes can vary based on the need for redundancy, availability, and load balancing.
For most production environments, a 3-node replica set is the most common setup.
4. Voting in Replication
In MongoDB replication, voting is part of the replica set's election process. When the primary
node becomes unavailable, an election is held to choose a new primary. Only the voting
members of the replica set participate in the election process.
Voting helps MongoDB ensure that there's a consistent primary node and that the replica set
remains operational.
Both GridFS and Sharding are MongoDB features used for handling large data, but they serve
different purposes:
GridFS: GridFS is a specification for storing and retrieving large files, such as images
or videos, in MongoDB. When a file exceeds the BSON document size limit (16MB),
GridFS splits the file into smaller chunks and stores each chunk as a separate document
in a fs.chunks collection, with metadata stored in a fs.files collection. GridFS is
ideal for storing large files and handling media storage within MongoDB.
Example Use Case: Storing and retrieving large media files such as videos or images.
Sharding: Sharding is a method for distributing data across multiple machines. It
allows MongoDB to scale horizontally by partitioning large datasets across multiple
servers (shards). Each shard holds a subset of the data, and MongoDB distributes
queries across all shards to balance the load.
Example Use Case: Distributing a large user database across multiple servers to handle
high volumes of read and write operations.
In summary, GridFS is used for storing large files, while Sharding is used for distributing
large datasets across multiple servers for scalability.
10. Sharding
What is Sharding?
Sharding is a method of distributing data across multiple servers to handle large
datasets and high throughput operations. It allows MongoDB to scale horizontally by
splitting data into smaller, more manageable pieces called shards.
Components of Sharding
1. Shards: Each shard holds a subset of the data. Shards are typically deployed as replica
sets for high availability.
2. mongos: Acts as a query router, directing client requests to the appropriate shard.
3. Config Servers: Store metadata and configuration settings for the cluster.
Disadvantages:
Sharding vs Replication
Sharding: Distributes data across multiple servers to handle large datasets and high
throughput. It focuses on horizontal scaling.
Replication: Duplicates data across multiple servers to ensure high availability and fault
tolerance. It focuses on data redundancy.
1. Choose the Right Shard Key: Select a shard key that evenly distributes data.
2. Monitor Performance: Regularly monitor the performance of your sharded cluster.
3. Plan for Growth: Design your sharding strategy with future growth in mind.
4. Use Indexes: Ensure proper indexing to optimize query performance.
CAP Theorem
The CAP Theorem states that a distributed database can only guarantee two out of three
properties at the same time: Consistency, Availability, and Partition
Tolerance. MongoDB prioritizes availability and partition tolerance.
Capped Collections
Capped Collections are fixed-size collections that automatically overwrite the oldest data
when they reach their size limit. They are useful for logging and caching scenarios1.
Sharding Disadvantages
11. GridFS
What is GridFS?
GridFS is a specification in MongoDB for storing and retrieving large files, such as images, videos,
and documents, that exceed the BSON document size limit of 16 MB. Instead of storing a file in a
single document, GridFS divides the file into smaller chunks and stores each chunk as a separate
document. This allows for efficient storage and retrieval of large files.
GridFS: Used for storing large files by breaking them into smaller chunks. It is ideal for files that
exceed the 16 MB limit and allows for partial file retrieval without loading the entire file into
memory.
Sharding: Distributes data across multiple servers to handle large datasets and high throughput. It
improves performance and scalability by dividing the data into smaller, more manageable pieces.
Transactions in MongoDB
Transactions in MongoDB allow you to group multiple read and write operations into a
single, atomic operation. This means that either all operations in the transaction succeed,
or none do. Transactions ensure data consistency and are useful for complex operations
that span multiple documents or collections.
ACID Compliance
ACID stands for Atomicity, Consistency, Isolation, and Durability:
Batch Sizing
Batch Sizing in MongoDB controls the number of documents returned in each batch of a
query response. Adjusting the batch size can optimize performance:
Large Batch Size: Reduces the number of network round trips but uses more memory.
Small Batch Size: Uses less memory but increases the number of network round trips.
Upsert Operations
An Upsert operation in MongoDB is a combination of update and insert. If a document
matching the query criteria exists, it updates the document. If no matching document is
found, it inserts a new document. This is useful for ensuring that data is always up-to-date
without needing separate insert and update logic.
Backup: Use the mongodump command to create a backup of your MongoDB database.
o mongodump --db mydatabase --out /backup/directory
Restore: Use the mongorestore command to restore a MongoDB database from a backup.
o mongorestore --db mydatabase /backup/directory/mydatabase
1. Follow the 3-2-1 Rule: Keep three copies of your data, two on different storage devices, and one
off-site.
2. Automate Backups: Schedule regular backups to avoid forgetting.
3. Test Your Backups: Regularly test your backups to ensure they can be restored successfully.
4. Use Encryption: Encrypt your backups to protect sensitive data.
5. Monitor Backup Processes: Continuously monitor your backup processes to detect and resolve
issues promptly.
1. Document Your Restore Procedures: Have clear, documented procedures for restoring data.
2. Test Restores Regularly: Regularly test your restore process to ensure it works as expected.
3. Verify Data Integrity: After restoring, verify the integrity and consistency of the data.
4. Minimize Downtime: Plan your restore process to minimize downtime and impact on users.
5. Keep Backup Logs: Maintain logs of backup and restore operations for auditing and
troubleshooting.
RBAC: Assigns roles to users, and each role has specific permissions. Roles can be built-in
(like readWrite, dbAdmin) or custom-defined
Roles: Control access to database resources and operations. Users can have multiple roles, and
roles can inherit permissions from other roles2
1. Encrypt Data at Rest: Use MongoDB’s built-in encryption for data stored on disk. This requires
MongoDB Enterprise or MongoDB Atlas.
2. Encrypt Data in Transit: Enable TLS/SSL to encrypt data as it travels over the network5.
3. Client-Side Field Level Encryption: Encrypt sensitive fields on the client side before sending
them to the server.
4. Key Management: Use a secure Key Management System (KMS) to store and manage
encryption keys.
Query Optimization
Query Optimization involves refining queries to reduce execution time and resource
consumption. This can be achieved by:
Using indexes: Ensure queries use indexes to avoid full collection scans.
Avoiding unnecessary data retrieval: Only fetch the fields you need.
Optimizing joins and aggregations: Simplify complex queries and use efficient join
operations.
Caching Strategies
Caching stores frequently accessed data in a temporary storage area to reduce access time.
Common caching strategies include:
Cache-Aside: The application checks the cache first before querying the database.
Read-Through: The cache automatically loads data from the database on a cache miss.
Write-Through: Data is written to the cache and the database simultaneously.
Write-Back: Data is written to the cache first and then asynchronously to the database.
Load Balancing
Load Balancing distributes incoming network traffic across multiple servers to ensure no
single server becomes overwhelmed. This improves application performance and reliability
by:
1. Keep Statistics Up to Date: Ensure database statistics are current to generate optimal
execution plans.
2. Avoid Leading Wildcards: Leading wildcards in queries force full table scans, which are
slow.
3. Use Constraints: Constraints help the database optimizer create better execution plans.
4. **Avoid SELECT ***: Only retrieve the fields you need to reduce data transfer and processing
time.
5. Monitor and Analyze: Regularly monitor performance metrics and analyze slow queries
to identify bottlenecks
16. MongoDB Atlas
Atlas Clustering
Atlas Clustering involves creating clusters that can be either replica sets or sharded
clusters:
Replica Sets: Provide high availability and redundancy by replicating data across multiple
nodes.
Sharded Clusters: Distribute data across multiple shards to handle large datasets and high
throughput.
1. Encryption in Transit: Uses TLS/SSL to encrypt data as it travels over the network.
2. Encryption at Rest: Encrypts data stored on disk to protect it from unauthorized access.
3. IP Access List: Restricts database access to specified IP addresses.
4. User Authentication and Authorization: Uses Role-Based Access Control (RBAC) to
manage permissions.
5. Network Isolation: Supports Virtual Private Cloud (VPC) peering and private endpoints for
secure network configurations.
6. Auditing: Tracks and logs database events for monitoring and compliance.
db.createCollection("students", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "age", "gpa"],
properties: {
name: {
bsonType: "string",
description: "must be a string and is required"
},
age: {
bsonType: "int",
minimum: 0,
description: "must be an integer greater than or equal to 0"
},
gpa: {
bsonType: "double",
minimum: 0,
maximum: 4,
description: "must be a double between 0 and 4"
}
}
}
}
});
This rule ensures that every document in the students collection has
a name (string), age (integer), and gpa (double between 0 and 4).
1. Start Simple: Begin with basic validation rules and gradually add complexity as needed.
2. Use Descriptive Messages: Include descriptions in your validation rules to provide clear error
messages.
3. Test Regularly: Regularly test your validation rules to ensure they work as expected.
4. Combine with Application-Level Validation: While MongoDB’s validation provides a safety net,
also validate data at the application level for more control.
5. Monitor and Adjust: Continuously monitor the effectiveness of your validation rules and adjust
them based on your application’s needs.
CAP Theorem
The CAP Theorem states that in a distributed database system, you can only achieve two
out of the following three guarantees at the same time:
Consistency: Every read receives the most recent write.
Availability: Every request receives a response, even if it’s not the most recent.
Partition Tolerance: The system continues to operate despite network partitions.
Data Redundancy
Data Redundancy refers to the practice of storing the same piece of data in multiple
places. This can be intentional for backup and recovery purposes or accidental due to
inefficient data management. While redundancy can improve data availability and fault
tolerance, it can also lead to data inconsistency and increased storage costs if not managed
properly.
Clustered Collections
Clustered Collections in MongoDB store documents ordered by a clustered index
key. This means that the documents are physically stored in the order of the index key,
which can improve query performance for range queries and equality comparisons on the
clustered index key.
Materialized Views
A Materialized View is a database object that contains the results of a query. Unlike
regular views, which are virtual and recomputed each time they are accessed, materialized
views store the query results physically. This can significantly improve query performance,
especially for complex queries that are frequently executed.
Decrement Operations
Decrement Operations in MongoDB are used to decrease the value of a field. This can be
done using the $inc operator with a negative value. For example:
db.collection.updateOne(
{ _id: 1 },
{ $inc: { count: -1 } }
);
This command decreases the count field by 1.
Alternatives to MongoDB
Cassandra
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large
amounts of data across many commodity servers without a single point of failure. It is known for
its high availability, fault tolerance, and linear scalability.
Redis
Redis (Remote Dictionary Server) is an in-memory data structure store used as a database, cache,
and message broker. It supports various data structures such as strings, hashes, lists, sets, and
more. Redis is known for its high performance and low latency.
DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service provided by AWS. It offers fast
and predictable performance with seamless scalability. DynamoDB is designed for applications that
require consistent, single-digit millisecond latency at any scale.
HBase
Apache HBase is an open-source, distributed, scalable, and NoSQL database modeled after
Google’s Bigtable. It is designed to handle large amounts of sparse data and is built on top of the
Hadoop Distributed File System (HDFS). HBase is known for its strong consistency and random,
real-time read/write access.
OrientDB
OrientDB is a multi-model NoSQL database that supports graph, document, key-value, and object
models. It is designed to be highly scalable and efficient, combining the flexibility of document
databases with the power of graph databases.
Scaling in MongoDB is essential for handling increasing data volumes, user traffic, and processing
demands. There are two main methods for scaling MongoDB: vertical scaling and horizontal
scaling.
Definition: Increasing the capacity of a single server by adding more resources (CPU, RAM,
storage).
Use Case: Suitable for applications with moderate growth where a single server can handle the
increased load.
Example: Upgrading your server from 16GB RAM to 32GB RAM to handle more queries and data.
Horizontal Scaling (Scaling Out)
Definition: Adding more servers to distribute the load and data across multiple machines.
Use Case: Ideal for applications with significant growth, requiring more resources than a single
server can provide.
Techniques:
o Replication: Creating copies of the database on multiple servers to ensure high availability and
fault tolerance.
o Sharding: Distributing data across multiple servers (shards) to balance the load and improve
performance.