Group Data with the Subset Pattern
MongoDB keeps frequently accessed data, referred to as the working set, in RAM. When the working set of data and indexes grows beyond the physical RAM allotted, performance is reduced as disk accesses starts to occur and data is no longer retrieved from RAM.
To solve this problem, you can shard your collection. However, sharding can create additional costs and complexities that your application may not be ready for. Rather than sharding your collection, you can reduce the size of your working set by using the subset pattern.
The subset pattern is a data modeling technique used to handle scenarios where you have a large array of items within a document, but need to access frequently a small subset of those items. In this case, the document size can often cause the working set to exceed the computer's RAM capacities. The subset pattern helps optimize performance by reducing the amount of data that needs to be read from the database for common queries.
About this Task
Consider an e-commerce site that has a list of reviews for a product, stored in a
collection called products
. The e-commerce site inserts
documents with the following schema into the products
collection:
db.collection('products').insertOne( [ { _id: ObjectId("507f1f77bcf86cd99338452"), name: "Super Widget", description: "This is the most useful item in your toolbox." price: { value: NumberDecimal("119.99"), currency: "USD" }, reviews: [ { review_id: 786, review_author: "Kristina", review_text: "This is indeed an amazing widgt.", published_date: ISODate("2019-02-18") }, { review_id: 785, review_author: "Trina", review_text: "Very nice product, slow shipping.", published_date: ISODate("2019-02-17") }, [...], { review_id: 1, review_author: "Hans", review_text: "Meh, it's ok.", published_date: ISODate("2017-12-06") } ] } ] )
When accessing a product’s data, you likely only need the most recent reviews. The following procedure demonstrates how to apply the subset pattern to the above schema.
Steps
Separate the subset into different collections.
Instead of storing all the reviews with the product, split your collection into two collections: one for your most accessed data, and one for your least accessed data. This allows for quick access to the most relevant data without having to load the entire array.
The first collection, the products
collection, contains the
most frequently used data, such as current reviews:
db.collection('products').insertOne( [ { _id: ObjectId("507f1f77bcf86cd99338452"), name: "Super Widget", description: "This is the most useful item in your toolbox." price: { value: NumberDecimal("119.99"), currency: "USD" }, reviews: [ { review_id: 786, review_author: "Kristina", review_text: "This is indeed an amazing widget.", published_date: ISODate("2019-02-18") }, [...], { review_id: 776, review_author: "Pablo", review_text: "Amazing!", published_date: ISODate("2019-02-15") } ] } ] )
The products
collection only contains the ten most recent reviews.
This reduces the working set by only loading in a portion, or a subset, of the overall data.
The second collection, the reviews
collection, contains less frequently used data, such as old reviews:
db.collection('review').insertOne( [ { review_id: 786, review_author: "Kristina", review_text: "This is indeed an amazing widget.", product_id: ObjectId("507f1f77bcf86cd99338452"), published_date: ISODate("2019-02-18") }, { review_id: 785, review_author: "Trina", review_text: "Very nice product, slow shipping.", product_id: ObjectId("507f1f77bcf86cd99338452"), published_date: ISODate("2019-02-17") }, [...], { review_id: 1, review_author: "Hans", review_text: "Meh, it's ok.", product_id: ObjectId("507f1f77bcf86cd99338452"), published_date: ISODate("2017-12-06") } ] )
You can access the reviews
collection whenever you need to see additional
reviews. When considering where to split your data, store the most used fields
of your documents in your main collection and the less frequently used data in a new collection.
Results
By using smaller documents with more frequently accessed data, you reduce the overall size of the working set. This allows for shorter disk access times for the most frequently used information that your application needs.
Note
The subset pattern requires you to manage two collections, rather than one, as well as query multiple databases when you need to gather comprehensive information on a document, rather than the subset.