Unit v Dbms Updated
Unit v Dbms Updated
Data on External Storage – RAID- File Organizations – Indexing and Hashing -Trees – B+ tree
andB- Tree index files. Hashing: Static – Dynamic. Query Processing and Query Optimization -
Introduction to NoSQL & MongoDB: Advantages, Architecture, Data Models MongoDB
Datatypes and CRUD Operations.
The disk space manager is responsible for keeping track of available disk space.
The file manager, which provides the abstraction of a file of records to higher levels of
DBMS code, issues requests to the disk space manager to obtain and relinquish space on disk.
The records in databases are stored in file formats. Physically, the data is stored in
electromagnetic format on a device. The electromagnetic devices used in database systems
for data storage are classified as follows :
Primary Memory: The primary memory of a server is the type of data storage that is
directly accessible by the central processing unit, meaning that it doesn’t require any
other devices to read from it. The primary memory must, in general, function flawlessly
with equal contributions from the electric power supply, the hardware backup system, the
supporting devices, the coolant that moderates the system temperature, etc. The size of
these devices is considerably smaller and they are volatile. According to performance and
speed, the primary memory devices are the fastest devices, and this feature is in direct
correlation with their capacity. These primary memory devices are usually more expensive
due to their increased speed and performance.
Secondary Memory: Data storage devices known as secondary storage, as the name
suggests, are devices that can be accessed for storing data that will be needed at a later
point in time for various purposes or database actions. Therefore, these types of storage
systems are sometimes called backup units as well. Devices that are plugged or connected
externally fall under this memory category, unlike primary memory, which is part of the
CPU. The size of this group of devices is noticeably larger than the primary devices and
smaller than the tertiary devices. It is also regarded as a temporary storage system since it
can hold data when needed and delete it when the user is done with it. Compared to
primary storage devices as well as tertiary devices, these secondary storage devices are
slower with respect to actions and pace.
It usually has a higher capacity than primary storage systems, but it changes with the
technological world, which is expanding every day. Secondary storage systems nowadays
consist of magnetic disks and optical discs such as DVD or CD, which were used in earlier
times. The constructive development of technology has brought about modern devices
that have made it easier for users to handle multiple devices. In addition to portable hard
disks and reusable flash drives, peripheral devices are equipped with USB ports so that
they can be used as secondary storage devices through plug-and-play.
RAID is a network of redundant storage devices that will fill the shortcomings of one
device by connecting to another device in the chain. This machine uses processes such as
array data arrangement for structuring, mirroring, error correction codes, isolating disks
into multiple disks, etc. to ensure that data flows smoothly through. RAID levels range
from RAID 0, RAID 1, RAID 2, and so on. Based on the redundancy observed in the stored
data, these levels are planned and determined.
Tertiary Memory: For data storage, Tertiary Memory refers to devices that can hold a
large amount of data without being constantly connected to the server or the peripherals.
A device of this type is connected either to a server or to a device where the database is
stored from the outside. Due to the fact that tertiary storage provides more space than
other types of device memory but is most slowly performing, the cost of tertiary storage is
lower than primary and secondary. As a means to make a backup of data, this type of
storage is commonly used for making copies from servers and databases. The ability to
use secondary devices and to delete the contents of the tertiary devices is similar.
Memory Hierarchy:
A computer system has a hierarchy of memory. Direct access to a CPU’s main memory and
inbuilt registers is available. Accessing the main memory takes less time than running a CPU.
Cache memory is introduced to minimize this difference in speed. Data that is most
frequently accessed by the CPU resides in cache memory, which provides the fastest access
time to data. Fastest-accessing memory is the most expensive. Although large storage devices
are slower and less expensive than CPU registers and cache memory, they can store a greater
amount of data.
Magnetic Disks: Present-day computer systems use hard disk drives as secondary storage
devices. Magnetic disks store information using the concept of magnetism. Metal disks are
coated with magnetizable material to create hard disks. Spindles hold these disks vertically.
As the read/write head moves between the disks, it de-magnetizes or magnetizes the spots
under it. There are two magnetized spots: 0 (zero) and 1 (one). Formatted hard disks store
data efficiently by storing them in a defined order. The hard disk plate is divided into many
concentric circles, called tracks. Each track contains a number of sectors. Data on a hard disk
is typically stored in sectors of 512 bytes.
Primary Memory :
In a computer system, memory is a very essential part of the computer system and used to
store information for instant use or permanently. Based on computer memory working
features, memory is divided into two types.
Volatile Memory (RAM)
Non-volatile Memory (ROM)
Before understanding ROM, we will first understand what exactly volatile and non-volatile
memory is. Non-volatile memory is a type of computer memory that is used to retain stored
information during power is removed. It is less expensive than volatile memory. It has a large
storage capacity. ROM (read-only memory), flash memory are examples of non-volatile
memory. Whereas volatile memory is a temporary memory. In this memory, the data is stored
till the system is capable of, but once the power of the system is turned off the data within
the volatile memory is deleted automatically. RAM is an example of volatile memory.
ROM stands for Read-Only Memory. It is a non-volatile memory that is used to stores
important information which is used to operate the system. As its name refers to read-only
memory, we can only read the programs and data stored on it. It is also a primary memory
unit of the computer system. It contains some electronic fuses that can be programmed for a
piece of specific information. The information stored in the ROM in binary format. It is also
known as permanent memory.
Features of ROM (Read-Only Memory):
The data stored in PROM is permanently stored The EPROM can be reprogrammed and
and cannot be changed and erased. reusable multiple times.
PROM is not expensive compared to EPROM. EPROM is more expensive than PROM.
PROM is more flexible than EPROM. EPROM is less flexible than PROM.
Memory
In order to save data and instructions, memory is required. Memory is divided into cells, and
they are stored in the storage space present in the computer. Every cell has its unique
location/address. Memory is very essential for a computer as this is the way it becomes
somewhat more similar to a human brain.
In human brains, there are different ways of keeping a memory, like short-term memory,
long-term memory, implicit memory, etc. Likewise, in computers, there are different types of
memories or different ways of saving memories. They are Cache memory, Primary memory/
Main memory, Secondary memory.
Types of Memory
There are three types of memories. Cache memory is helpful in speeding up the CPU as it is a
high-speed memory, It consumes less time but is very expensive. The next type is the Main
memory or Primary memory which is used to store or hold the current data, it consists of
RAM and ROM, RAM is a volatile memory while ROM is non-volatile in nature. The third type
is Secondary memory, which is non-volatile in nature, it is used to store the data permanently
in a computer.
RAM (Random Access Memory)
It is one of the parts of the Main memory, also famously known as Read Write Memory.
Random Access memory is present on the motherboard and the computer’s data is
temporarily stored in RAM. As the name says, RAM can help in both Read and write. RAM is a
volatile memory, which means, it is present as long as the Computer is in ON state, as soon as
the computer turns OFF, the memory is erased.
In order to better understand RAM, imagine the blackboard of the classroom, the students
can both read and write and also erase the data written after the class is over, some new data
can be entered now.
Features of RAM:
RAM is volatile in nature, which means, the data is lost when the device is switched off.
RAM is known as the Primary memory of the computer.
RAM is known to be expensive since the memory can be accessed directly.
RAM is the fastest memory, therefore, it is an internal memory for the computer.
The speed of computer depends on RAM, say if the computer has less RAM, it will take
more time to load and the computer slows down.
Types of RAM
RAM is further divided into two types, SRAM – Static Random Access Memory and DRAM-
Dynamic Random Access Memory. Let’s learn about both of these types in more detail.
SRAM (Static Random Access memory)
SRAM is used for Cache memory, it can hold the data as long as the power availability is
there. It is refreshed simultaneously to store the present information. It is made with CMOS
technology. It contains 4 to 6 transistors and it also uses clocks. It does not require a periodic
refresh cycle due to the presence of transistors. Although SRAM is faster, it requires more
power and is more expensive in nature. Since, SRAM requires more power, more heat is lost
here as well, another drawback of SRAM is that it can not store more bits per chip, for
instance, for the same amount of memory stored in DRAM, SRAM would require one more
chip.
Function of SRAM: The function of SRAM is that it provides a direct interface with the Central
Processing Unit at higher speeds.
Characteristics of SRAM:
Secondary Memory
In a computer, memory refers to the physical devices that are used to store programs or data
on a temporary or permanent basis. It is a group of registers. Memory are of two types (i)
primary memory, (ii) secondary memory. Primary memory is made up of semiconductors, It is
also divided into two types, Read-Only Memory (ROM) and Random Access Memory (RAM).
Secondary memory is a physical device for the permanent storage of programs and data(Hard
disk, Compact disc, Flash drive, etc.).
Primary Memory
Primary memory is made up of semiconductors and it is the main memory of the computer
system. It is generally used to store data or information on which the computer is currently
working, so we can say that it is used to store data temporarily. Data or information is lost
when the systems are off. It is also divided into two types:
(i). Read-Only Memory (ROM)
ii). Random Access Memory (RAM).
1. Random Access Memory: Primary memory is also called internal memory. This is the main
area in a computer where data, instructions, and information are stored. Any storage location
in this memory can be directly accessed by the Central Processing Unit. As the CPU can
randomly access any storage location in this memory, it is also called Random Access Memory
or RAM. The CPU can access data from RAM as long as the computer is switched on. As soon
as the power to the computer is switched off, the stored data and instructions disappear from
RAM. Such type of memory is known as volatile memory. RAM is also called read/write
memory.
2. Read-Only Memory: Read-Only Memory (ROM) is a type of primary memory from which
information can only be read. So it is also known as Read-Only Memory. ROM can be directly
accessed by the Central Processing Unit. But, the data and instructions stored in ROM are
retained even when the computer is switched off OR we can say it holds the data after being
switched off. Such type of memory is known as non-volatile memory.
Secondary Memory
We have read so far, that primary memory is volatile and has limited capacity. So, it is
important to have another form of memory that has a larger storage capacity and from which
data and programs are not lost when the computer is turned off. Such a type of memory is
called secondary memory. In secondary memory, programs and data are stored. It is also
called auxiliary memory. It is different from primary memory as it is not directly accessible
through the CPU and is non-volatile. Secondary or external storage devices have a much
larger storage capacity and the cost of secondary memory is less as compared to primary
memory.
Secondary memory is used for different purposes but the main purposes of using secondary
memory are:
Permanent storage: As we know that primary memory stores data only when the power
supply is on, it loses data when the power is off. So we need a secondary memory to
stores data permanently even if the power supply is off.
Large Storage: Secondary memory provides large storage space so that we can store large
data like videos, images, audios, files, etc permanently.
Portable: Some secondary devices are removable. So, we can easily store or transfer data
from one computer or device to another.
Blu-ray Disc: A Blu-ray disc looks just like a CD or a DVD but it can store data or information
up to 25 GB data. If you want to use a Blu-ray disc, you need a Blu-ray reader. The name Blu-
ray is derived from the technology that is used to read the disc ‘Blu’ from the blue-violet laser
and ‘ray’ from an optical ray.
5. Hard Disk: A hard disk is a part of a unit called a hard disk drive. It is used to storing a large
amount of data. Hard disks or hard disk drives come in different storage capacities.(like 256
GB, 500 GB, 1 TB, and 2 TB, etc.). It is created using the collection of discs known as platters.
The platters are placed one below the other. They are coated with magnetic material. Each
platter consists of a number of invisible circles and each circle having the same centre called
tracks. Hard disk is of two types (i) Internal hard disk (ii) External hard disk.
6. Flash Drive: A flash drive or pen drive comes in various storage capacities, such as 1 GB, 2
GB, 4 GB, 8 GB, 16 GB, 32 GB, 64 GB, up to 1 TB. A flash drive is used to transfer and store
data. To use a flash drive, we need to plug it into a USB port on a computer. As a flash drive is
easy to use and compact in size, Nowadays it is very popular.
7. Solid-state disk: It is also known as SDD. It is a non-volatile storage device that is used to
store and access data. It is faster, does noiseless operations(because it does not contain any
moving parts like the hard disk), consumes less power, etc. It is a great replacement for
standard hard drives in computers and laptops if the price is low and it is also suitable for
tablets, notebooks, etc because they do not require large storage.
8. SD Card: It is known as a Secure Digital Card. It is generally used in portable devices like
mobile phones, cameras, etc., to store data. It is available in different sizes like 1 GB, 2 GB, 4
GB, 8 GB, 16 GB, 32 GB, 64 GB, etc. To view the data stored in the SD card you can remove
them from the device and insert them into a computer with help of a card reader. The data
stores in the SD card is stored in memory chips(present in the SD Card) and it does not
contain any moving parts like the hard disk.
1. RAID-0 (Stripping)
Blocks are “stripped” across disks.
RAID-0
Raid-0
Evaluation
Reliability: 0
There is no duplication of data. Hence, a block once lost cannot be recovered.
Capacity: N*B
The entire space is being used to store data. Since there is no duplication, N disks each
having B blocks are fully utilized.
Advantages
1. It is easy to implement.
2. It utilizes the storage capacity in a better way.
Disadvantages
1. A single drive loss can result in the complete failure of the system.
2. Not a good choice for a critical system.
2. RAID-1 (Mirroring)
More than one copy of each block is stored in a separate disk. Thus, every block has two (or
more) copies, lying on different disks.
Raid-1
Raid-3
Here Disk 3 contains the Parity bits for Disk 0, Disk 1, and Disk 2. If data loss occurs, we can
construct it with Disk 3.
Advantages
1. Data can be transferred in bulk.
2. Data can be accessed in parallel.
Disadvantages
1. It requires an additional drive for parity.
2. In the case of small-size files, it performs slowly.
5. RAID-4 (Block-Level Stripping with Dedicated Parity)
Instead of duplicating data, this adopts a parity-based approach.
Raid-4
Raid-4
Assume that in the above figure, C3 is lost due to some disk failure. Then, we can
recompute the data bit stored in C3 by looking at the values of all the other columns and
the parity bit. This allows us to recover lost data.
Evaluation
Reliability: 1
RAID-4 allows recovery of at most 1 disk failure (because of the way parity works). If more
than one disk fails, there is no way to recover the data.
Capacity: (N-1)*B
One disk in the system is reserved for storing the parity. Hence, (N-1) disks are made
available for data storage, each disk having B blocks.
Advantages
1. It helps in reconstructing the data if at most one data is lost.
Disadvantages
1. It can’t help in reconstructing when more than one data is lost.
6. RAID-5 (Block-Level Stripping with Distributed Parity)
This is a slight modification of the RAID-4 system where the only difference is that the parity
rotates among the drives.
Raid-5
Raid-6
Advantages
1. Very high data Accessibility.
2. Fast read data transactions.
Disadvantages
1. Due to double parity, it has slow write data transactions.
2. Extra space is required.
Advantages of RAID
1. Increased data reliability: RAID provides redundancy, which means that if one disk fails, the
data can be recovered from the remaining disks in the array. This makes RAID a reliable
storage solution for critical data.
2. Improved performance: RAID can improve performance by spreading data across multiple
disks. This allows multiple read/write operations to co-occur, which can speed up data
access.
3. Scalability: RAID can be scaled by adding more disks to the array. This means that storage
capacity can be increased without having to replace the entire storage system.
4. Cost-effective: Some RAID configurations, such as RAID 0, can be implemented with low-
cost hardware. This makes RAID a cost-effective solution for small businesses or home
users.
Disadvantages of RAID
1. Cost: Some RAID configurations, such as RAID 5 or RAID 6, can be expensive to implement.
This is because they require additional hardware or software to provide redundancy.
2. Performance limitations: Some RAID configurations, such as RAID 1 or RAID 5, can have
performance limitations. For example, RAID 1 can only read data as fast as a single drive,
while RAID 5 can have slower write speeds due to the parity calculations required.
3. Complexity: RAID can be complex to set up and maintain. This is especially true for more
advanced configurations, such as RAID 5 or RAID 6.
4. Increased risk of data loss: While RAID provides redundancy, it is not a substitute for proper
backups. If multiple drives fail simultaneously, data loss can still occur.
File Organization
o The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which
was used for a given set of records.
o File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure our
files so that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length
records.
File organization contains various methods. These particular methods have pros and cons on
the basis of access or selection. In the file organization, the programmer decides the best-suited
file organization method according to his requirement.
This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records
are nothing but a row in the table. Suppose we want to insert a new record R2 in the sequence,
then it will be placed at the end of the file. Here, records are nothing but a row in any table.
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and
R7. Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the
end of the file, and then it will sort the sequence.
Pros of sequential file organization
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic
tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation
of a student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.
Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a
new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the database
selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to
traverse the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization, we
need to check all the data until we get the requested record.
Hash File Organization uses the computation of hash function on some fields of the records.
The hash function's output determines the location of disk block where the records are to be
placed.
When a record has to be received using the hash key columns, then the address is generated,
and the whole record is retrieved using that address. In the same way, when a new record has
to be inserted, then the address is generated using the hash key and record is directly inserted.
The same process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method, records are stored in
the file using the primary key. An index value is generated for each primary key and mapped
with the record. This index contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
o In this method, each record has the address of its data block, searching a record in a
huge database is quick and easy.
o This method supports range retrieval and partial retrieval of records. Since the index is
based on the primary key values, we can retrieve the data for the given range of value.
In the same way, the partial value can also be easily searched, i.e., the student name
starting with 'JA' can be easily searched.
Cons of ISAM
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be reconstructed to
maintain the sequence.
o When the record is deleted, then the space used by it needs to be released. Otherwise,
the performance of the database will slow down.
Redundant Array of Independent Disks
RAID or Redundant Array of Independent Disks, is a technology to
connect multiple secondary storage devices and use them as a single
storage media.
RAID consists of an array of disks in which multiple disks are
connected together to achieve different goals. RAID levels define the
use of disk arrays.
Redundant Array of Independent Disks
RAID or Redundant Array of Independent Disks, is a technology to
connect multiple secondary storage devices and use them as a single
storage media.
RAID consists of an array of disks in which multiple disks are
connected together to achieve different goals. RAID levels define the
use of disk arrays.
Indexing in DBMS
Index structure:
o The first column of the database is the search key that contains a copy of the primary
key or candidate key of the table. The values of the primary key are stored in sorted
order so that the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-
543.
o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record
after reading 542*2= 1084 bytes which are very less compared to the previous case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1 relation
between the records.
o As primary keys are stored in sorted order, the performance of the searching operation
is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in the
main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created
on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for
these group.
Example: suppose a company contains several employees in each department. Suppose we use
a clustering index, where all employees which belong to the same Dept_ID are considered
within a single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-
unique key.
The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is
called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping
of the first level is stored in the primary memory, so that address fetch is faster. The mapping of
the second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block and starts searching each record till it gets
111.
o This is how a search is performed in this method. Inserting, updating or deleting is also
done in the same manner.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf
nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of
the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60
there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50,
55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy
to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it
to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
Hashing
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on the
disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data bucket
or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function. We can
even consider the primary key itself as the address of the data block. That means each row
whose address will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we
have mod (5) hash function to determine the address of the data block. In this case, it applies
mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and
records are stored in those data block addresses.
Types of Hashing:
Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always result
in same bucket address 3. Here, there will be no change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store the
data.
Operations of Static Hashing
o Searching a record
When a record needs to be searched, then the same hash function retrieves the address of the
bucket where the data is stored.
o Insert a Record
When a new record is inserted into the table, then we will generate an address for a new
record based on the hash key and record is stored in that location.
o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will
delete the records for that address in memory.
o Update a Record
To update a record, we will first search it using a hash function, and then the data record is
updated.
If we want to insert some new record into the file but the address of a data bucket generated
by the hash function is not empty, or data already exists in that address. This situation in the
static hashing is known as bucket overflow. This is a critical situation in this method.
To overcome this situation, there are various methods. Some commonly used methods are as
follows:
1. Open Hashing
When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is
linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this
case, a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.
As query processing includes certain activities for data retrieval. Initially, the given user queries
get translated in high-level database languages such as SQL. It gets translated into expressions
that can be further used at the physical level of the file system. After this, the actual evaluation
of the queries and a variety of query -optimizing transformations and takes place. Thus before
processing a query, a computer system needs to translate the query into a human-readable and
understandable language. Consequently, SQL or Structured Query Language is the best suitable
choice for humans. But, it is not perfectly suitable for the internal representation of the query
to the system. Relational algebra is well suited for the internal representation of a query. The
translation process in query processing is similar to the parser of a query. When a user executes
any query, for generating the internal form of the query, the parser in the system checks the
syntax of the query, verifies the name of the relation in the database, the tuple, and finally the
required attribute value. The parser creates a tree of the query, known as 'parse-tree.' Further,
translate it into the form of relational algebra. With this, it evenly replaces all the use of the
views when used in the query.
Thus, we can understand the working of a query processing in the below-described diagram:
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query is
undertaken:
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and evaluating
each operation. Thus, after translating the user query, the system executes a query evaluation
plan.
Optimization
o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is known as
Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
NoSQL
NoSQL Database is a non-relational Data Management System, that does not require a fixed
schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for
distributed data stores with humongous data storage needs. NoSQL is used for Big data and
real-time web apps. For example, companies like Twitter, Facebook and Google collect
terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be
“NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data. Let’s understand about NoSQL
with a diagram in this NoSQL database tutorial:
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google, Facebook,
Amazon, etc. who deal with huge volumes of data. The system response time becomes slow
when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing hardware.
This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the load
increases. This method is known as “scaling out.”
NoSQL database is non-relational, so it scales out better than relational databases as they are
designed with web applications in mind.
Brief History of NoSQL Databases
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
2000- Graph database Neo4j is launched
2004- Google BigTable is launched
2005- CouchDB is launched
2007- The research paper on Amazon Dynamo is released
2008- Facebooks open sources the Cassandra project
2009- The term NoSQL was reintroduced
Features of NoSQL
Non-relational
NoSQL databases never follow the relational model
Never provide tables with flat fixed-column records
Work with self-contained aggregates or BLOBs
Doesn’t require object-relational mapping and data normalization
No complex features like query languages, query planners,referential integrity joins,
ACID
Schema-free
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
Simple API
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based NoSQL query language
Web-enabled databases running as internet-facing services
Distributed
Multiple NoSQL databases can be executed in a distributed fashion
Offers auto-scaling and fail-over capabilities
Often ACID concept can be sacrificed for scalability and throughput
Mostly no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
Only providing eventual consistency
Shared Nothing Architecture. This enables less coordination and higher distribution.
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as a
collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored contiguously.
Column based NoSQL database
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the
data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
Document-Oriented
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is
stored as a document. The document is stored in JSON or XML formats. The value is understood
by the DB and can be queried.
Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Query Mechanism tools for NoSQL
The most common data retrieval mechanism is the REST-based retrieval of a value based on its
key/ID with GET resource
Document store Database offers more difficult queries as they understand the value in a key-
value pair. For example, CouchDB allows defining views with MapReduce
What is the CAP Theorem?
CAP theorem is also called brewer’s theorem. It states that is impossible for a distributed data
store to offer more than two out of three guarantees
1. Consistency
2. Availability
3. Partition Tolerance
Consistency:
The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.
Availability:
The database should always be available and responsive. It should not have any downtime.
Partition Tolerance:
Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the database
is unavailable, other parts are always unaffected.
Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple machines to get high
availability and scalability. Thus, changes made to any data item on one machine has to be
propagated to other replicas.
Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.
BASE: Basically Available, Soft state, Eventual consistency
Basically, available means DB is available all the time as per CAP theorem
Soft state means even without an input; the system state may change
Eventual consistency means that the system will become consistent over time
Advantages of NoSQL
Can be used as Primary or Analytic Data Source
Big Data Capability
No Single Point of Failure
Easy Replication
No Need for Separate Caching Layer
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with equal effect
Object-oriented programming which is easy to use and flexible
NoSQL databases don’t need a dedicated high-performance server
Support Key Developer Languages and Platforms
Simple to implement than using RDBMS
It can serve as the primary data source for online applications.
Handles big data which manages data velocity, variety, volume, and complexity
Excels at distributed database and multi-data center operations
Eliminates the need for a specific caching layer to store data
Offers a flexible schema design which can easily be altered without downtime or service
disruption
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain unique values as keys
become difficult
Doesn’t work as well with relational data
The learning curve is stiff for new developers
Open source options so not so popular for enterprises.
Summary
NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is
easy to scale
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data
In the year 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
relational database
NoSQL databases never follow the relational model it is either schema-free or has
relaxed schemas
Four types of NoSQL Database are 1). Key-value Pair Based 2). Column-oriented Graph
3). Graphs based 4). Document-oriented
NOSQL can handle structured, semi-structured, and unstructured data with equal effect
CAP theorem consists of three words Consistency, Availability, and Partition Tolerance
BASE stands for Basically Available, Soft state, Eventual consistency
The term “eventual consistency” means to have copies of data on multiple machines to
get high availability and scalability
NOSQL offer limited query capabilitie
MongoDB
MongoDB is a document-oriented NoSQL database used for high volume data storage. Instead
of using tables and rows as in the traditional relational databases, MongoDB makes use of
collections and documents. Documents consist of key-value pairs which are the basic unit of
data in MongoDB. Collections contain sets of documents and function which is the equivalent of
relational database tables. MongoDB is a database which came into light around the mid-2000s.
MongoDB Features
1. Each database contains collections which in turn contains documents. Each document
can be different with a varying number of fields. The size and content of each document
can be different from each other.
2. The document structure is more in line with how developers construct their classes and
objects in their respective programming languages. Developers will often say that their
classes are not rows and columns but have a clear structure with key-value pairs.
3. The rows (or documents as called in MongoDB) doesn’t need to have a schema defined
beforehand. Instead, the fields can be created on the fly.
4. The data model available within MongoDB allows you to represent hierarchical
relationships, to store arrays, and other more complex structures more easily.
5. Scalability – The MongoDB environments are very scalable. Companies across the world
have defined clusters with some of them running 100+ nodes with around millions of
documents within the database
MongoDB Example
The below example shows how a document can be modeled in MongoDB.
1. The _id field is added by MongoDB to uniquely identify the document in the collection.
2. What you can note is that the Order Data (OrderID, Product, and Quantity ) which in
RDBMS will normally be stored in a separate table, while in MongoDB it is actually
stored as an embedded document in the collection itself. This is one of the key
differences in how data is modeled in MongoDB.
7. JSON – This is known as JavaScript Object Notation. This is a human-readable, plain text
format for expressing structured data. JSON is currently supported in many
programming languages.
Just a quick note on the key difference between the _id field and a normal collection field. The
_id field is used to uniquely identify the documents in a collection and is automatically added by
MongoDB when the collection is created.
Why Use MongoDB?
Below are the few of the reasons as to why one should start using MongoDB
1. Document-oriented – Since MongoDB is a NoSQL type database, instead of having data
in a relational type format, it stores the data in documents. This makes MongoDB very
flexible and adaptable to real business world situation and requirements.
2. Ad hoc queries – MongoDB supports searching by field, range queries, and regular
expression searches. Queries can be made to return specific fields within documents.
3. Indexing – Indexes can be created to improve the performance of searches within
MongoDB. Any field in a MongoDB document can be indexed.
4. Replication – MongoDB can provide high availability with replica sets. A replica set
consists of two or more mongo DB instances. Each replica set member may act in the
role of the primary or secondary replica at any time. The primary replica is the main
server which interacts with the client and performs all the read/write operations. The
Secondary replicas maintain a copy of the data of the primary using built-in replication.
When a primary replica fails, the replica set automatically switches over to the
secondary and then it becomes the primary server.
5. Load balancing – MongoDB uses the concept of sharding to scale horizontally by splitting
data across multiple MongoDB instances. MongoDB can run over multiple servers,
balancing the load and/or duplicating data to keep the system up and running in case of
hardware failure.
Data Modelling in MongoDB
As we have seen from the Introduction section, the data in MongoDB has a flexible schema.
Unlike in SQL databases, where you must have a table’s schema declared before inserting data,
MongoDB’s collections do not enforce document structure. This sort of flexibility is what makes
MongoDB so powerful.
When modeling data in Mongo, keep the following things in mind
1. What are the needs of the application – Look at the business needs of the application
and see what data and the type of data needed for the application. Based on this,
ensure that the structure of the document is decided accordingly.
2. What are data retrieval patterns – If you foresee a heavy query usage then consider the
use of indexes in your data model to improve the efficiency of queries.
3. Are frequent inserts, updates and removals happening in the database? Reconsider the
use of indexes or incorporate sharding if required in your data modeling design to
improve the efficiency of your overall MongoDB environment.
Difference between MongoDB & RDBMS
Below are some of the key term differences between MongoDB and RDBMS
RDBMS MongoDB Difference
Table Collection In RDBMS, the table contains the columns and rows which
are used to store the data whereas, in MongoDB, this same
structure is known as a collection. The collection contains
documents which in turn contains Fields, which in turn are
key-value pairs.
In RDBMS, the row represents a single, implicitly structured
Row Document data item in a table. In MongoDB, the data is stored in
documents.
Column Field In RDBMS, the column denotes a set of data values. These in
MongoDB are known as Fields.
Joins Embedded documents In RDBMS, data is sometimes spread across various tables
and in order to show a complete view of all data, a join is
sometimes formed across tables to get the data. In
MongoDB, the data is normally stored in a single collection,
but separated by using Embedded documents. So there is no
concept of joins in MongoDB.
Apart from the terms differences, a few other differences are shown below
1. Relational databases are known for enforcing data integrity. This is not an explicit
requirement in MongoDB.
2. RDBMS requires that data be normalized first so that it can prevent orphan records and
duplicates Normalizing data then has the requirement of more tables, which will then
result in more table joins, thus requiring more keys and indexes.As databases start to
grow, performance can start becoming an issue. Again this is not an explicit requirement
in MongoDB. MongoDB is flexible and does not need the data to be normalized first.
What is MongoDB?
MongoDB is an open source NoSQL database management program. NoSQL (Not only SQL) is
used as an alternative to traditional relational databases. NoSQL databases are quite useful for
working with large sets of distributed data. MongoDB is a tool that can manage document-
oriented information, store or retrieve information.
MongoDB is used for high-volume data storage, helping organizations store large amounts of
data while still performing rapidly. Organizations also use MongoDB for its ad-hoc queries,
indexing, load balancing, aggregation, server-side JavaScript execution and other features.
Structured Query Language (SQL) is a standardized programming language that is used to
manage relational databases. SQL normalizes data as schemas and tables, and every table has a
fixed structure.
Instead of using tables and rows as in relational databases, as a NoSQL database, the MongoDB
architecture is made up of collections and documents. Documents are made up of key-value
pairs -- MongoDB's basic unit of data. Collections, the equivalent of SQL tables, contain
document sets. MongoDB offers support for many programming languages, such as C, C++, C#,
Go, Java, Python, Ruby and Swift.
How does MongoDB work?
MongoDB environments provide users with a server to create databases with MongoDB.
MongoDB stores data as records that are made up of collections and documents.
Documents contain the data the user wants to store in the MongoDB database. Documents are
composed of field and value pairs. They are the basic unit of data in MongoDB. The documents
are similar to JavaScript Object Notation (JSON) but use a variant called Binary JSON (BSON).
The benefit of using BSON is that it accommodates more data types. The fields in these
documents are like the columns in a relational database. Values contained can be a variety of
data types, including other documents, arrays and arrays of documents, according to the
MongoDB user manual. Documents will also incorporate a primary key as a unique identifier. A
document's structure is changed by adding or deleting new or existing fields.
Sets of documents are called collections, which function as the equivalent of relational
database tables. Collections can contain any type of data, but the restriction is the data in a
collection cannot be spread across different databases. Users of MongoDB can create multiple
databases with multiple collections.
The mongo shell is a standard component of the open-source distributions of MongoDB. Once
MongoDB is installed, users connect the mongo shell to their running MongoDB instances. The
mongo shell acts as an interactive JavaScript interface to MongoDB, which allows users to query
or update data and conduct administrative operations.
A binary representation of JSON-like documents is provided by the BSON document storage and
data interchange format. Automatic sharding is another key feature that enables data in a
MongoDB collection to be distributed across multiple systems for horizontal scalability, as
data volumes and throughput requirements increase.
The NoSQL DBMS uses a single master architecture for data consistency, with secondary
databases that maintain copies of the primary database. Operations are automatically
replicated to those secondary databases for automatic failover.
MongoDB platforms
MongoDB is available in community and commercial versions through vendor MongoDB Inc.
MongoDB Community Edition is the open source release, while MongoDB Enterprise Server
brings added security features, an in-memory storage engine, administration and
authentication features, and monitoring capabilities through Ops Manager.
A graphical user interface (GUI) named MongoDB Compass gives users a way to work with
document structure, conduct queries, index data and more. The MongoDB Connector for BI lets
users connect the NoSQL database to their business intelligence tools to visualize data and
create reports using SQL queries.
Following in the footsteps of other NoSQL database providers, MongoDB Inc. launched a cloud
database as a service named MongoDB Atlas in 2016. Atlas runs on AWS, Microsoft Azure and
Google Cloud Platform. Later, MongoDB released a platform named Stitch for application
development on MongoDB Atlas, with plans to extend it to on-premises databases.
Architecture
When designing a modern application, chances are that you will need a database to store data.
There are many ways to architect software solutions that use a database, depending on how
your application will use this data. In this article, we will cover the different types of database
architecture and describe in greater detail a three-tier application architecture, which is
extensively used in modern web applications.
What is database architecture?
Database architecture describes how a database management system (DBMS) will be
integrated with your application. When designing a database architecture, you must make
decisions that will change how your applications are created.
First, decide on the type of database you would like to use. The database could be centralized
or decentralized. Centralized databases are typically used for regular web applications and will
be the focus of this article. Decentralized databases, such as blockchain databases, might
require a different architecture.
Once you’ve decided the type of database you want to use, you can determine the type of
architecture you want to use. Typically, these are categorized into single-tier or multi-tier
applications.
What are the types of database architecture?
When we talk about database architectures, we refer to the number of tiers an application has.
1-tier architecture
In 1-tier architecture, the database and any application interfacing with the database are kept
on a single server or device. Because there are no network delays involved, this is generally a
fast way to access data.
On a single-tier application, the application and database reside on the same device.
An example of a 1-tier architecture would be a mobile application that uses Realm, the open-
source mobile database by MongoDB, as a local database. In that case, both the application and
the database are running on the user’s mobile device.
2-tier architecture
2-tier architectures consist of multiple clients connecting directly to the database. This
architecture is also known as client-server architecture.
In a 3-tier architecture, the information between the database and the clients is relayed by a
back-end server.
An example of this type of architecture would be a React application that connects to a Node.js
back end. The Node.js back end processes the requests and fetches the necessary information
from a database such as MongoDB Atlas, using the native driver. This architecture is described
in greater detail in the next section.
What are the three levels of database architecture in MongoDB Atlas?
The most common DBMS architecture used in modern application development is the 3-tier
model. Since it’s so popular, let’s look at what this architecture looks like with MongoDB Atlas.
A three tier application is composed of three layers, the data, the application, and the
presentation.
As you can see in this diagram, the 3-tier architecture comprises the data, application, and
presentation levels.
Data (database) layer
As the name suggests, the data layer is where the data resides. In the scenario above, the data
is stored in a MongoDB Atlas database hosted on any public cloud—or across multiple clouds, if
needed. The only responsibility of this layer is to keep the data accessible for the application
layer and run the queries efficiently.
Application (middle) layer
The application tier is in charge of communicating with the database. To ensure secure access
to the data, requests are initiated from this tier. In a modern web application, this would be
your API. A back-end application built with Node.js (or any other programming language with
a native driver) makes requests to the database and relays the information back to the clients.
Presentation (user) layer
The final layer is the presentation layer. This is usually the UI of the application with which the
users will interact. In the case of a MERN or MEAN stack application, this would be the
JavaScript front end built with React or Angular.
Summary
In this article, you’ve learned about the different types of database architecture. A 3-tier
architecture is your go-to solution for most modern web applications. However, there are other
topologies that you might want to explore. For example, the type of database you use could be
a dedicated or a serverless instance, depending on your predicted usage model. You could also
supplement your database with data lakes or even online archiving to make the best use of
your hardware resources. If you are ready to concretize your database architecture, why not
try MongoDB Atlas, the database-as-a-service solution from MongoDB? Using the realm-web
SDK, you can even host all three tiers of your web application on MongoDB Atlas.
Data Model Design
MongoDB provides two types of data models: — Embedded data model and Normalized data
model. Based on the requirement, you can use either of the models while preparing your
document.
Embedded Data Model
In this model, you can have (embed) all the related data in a single document, it is also known
as de-normalized data model.
For example, assume we are getting the details of employees in three different documents
namely, Personal_details, Contact and, Address, you can embed all the three documents in a
single one as shown below −
{
_id: ,
Emp_ID: "10025AE336"
Personal_details:{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26"
},
Contact: {
e-mail: "[email protected]",
phone: "9848022338"
},
Address: {
city: "Hyderabad",
Area: "Madapur",
State: "Telangana"
}
}
Normalized Data Model
In this model, you can refer the sub documents in the original document, using references. For
example, you can re-write the above document in the normalized model as:
Employee:
{
_id: <ObjectId101>,
Emp_ID: "10025AE336"
}
Personal_details:
{
_id: <ObjectId102>,
empDocID: " ObjectId101",
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26"
}
Contact:
{
_id: <ObjectId103>,
empDocID: " ObjectId101",
e-mail: "[email protected]",
phone: "9848022338"
}
Address:
{
_id: <ObjectId104>,
empDocID: " ObjectId101",
city: "Hyderabad",
Area: "Madapur",
State: "Telangana"
}
Considerations while designing Schema in MongoDB
Design your schema according to user requirements.
Combine objects into one document if you will use them together. Otherwise separate
them (but make sure there should not be need of joins).
Duplicate the data (but limited) because disk space is cheap as compare to compute
time.
Do joins while write, not on read.
Optimize your schema for most frequent use cases.
Do complex aggregation in the schema.
Example
Suppose a client needs a database design for his blog/website and see the differences between
RDBMS and MongoDB schema design. Website has the following requirements.
Every post has the unique title, description and url.
Every post can have one or more tags.
Every post has the name of its publisher and total number of likes.
Every post has comments given by users along with their name, message, data-time and
likes.
On each post, there can be zero or more comments.
In RDBMS schema, design for above requirements will have minimum three tables.
While in MongoDB schema, design will have one collection post and the following structure −
{
_id: POST_ID
title: TITLE_OF_POST,
description: POST_DESCRIPTION,
by: POST_BY,
url: URL_OF_POST,
tags: [TAG1, TAG2, TAG3],
likes: TOTAL_LIKES,
comments: [
{
user:'COMMENT_BY',
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
},
{
user:'COMMENT_BY',
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
}
]
}
So while showing the data, in RDBMS you need to join three tables and in MongoDB, data will
be shown from one collection only.
String − This is the most commonly used datatype to store the data. String in MongoDB
must be UTF-8 valid.
Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit
depending upon your server.
Boolean − This type is used to store a boolean (true/ false) value.
Double − This type is used to store floating point values.
Min/ Max keys − This type is used to compare a value against the lowest and highest
BSON elements.
Arrays − This type is used to store arrays or list or multiple values into one key.
Timestamp − ctimestamp. This can be handy for recording when a document has been
modified or added.
Object − This datatype is used for embedded documents.
Null − This type is used to store a Null value.
Symbol − This datatype is used identically to a string; however, it's generally reserved
for languages that use a specific symbol type.
Date − This datatype is used to store the current date or time in UNIX time format. You
can specify your own date time by creating object of Date and passing day, month, year
into it.
Object ID − This datatype is used to store the document’s ID.
Binary data − This datatype is used to store binary data.
Code − This datatype is used to store JavaScript code into the document.
Regular expression − This datatype is used to store regular expression.
CRUD Operations
MongoDB CRUD operations are the fundamental operations used to manage data within a
MongoDB database. CRUD stands for Create, Read, Update, and Delete, and these operations
are essential for working with documents in a MongoDB collection. Let's dive deeper into each
of these CRUD operations:
Create Operations:
Create or Insert Operations: These operations add new documents to a collection. If the
collection does not exist, insert operations will create it. MongoDB provides the following
methods for creating or inserting documents:
db.collection.insertOne(): This method is used to insert a single document into the collection.
Read Operations:
Read Operations: These operations retrieve documents from a collection, effectively querying
the collection for documents. MongoDB provides the following methods for reading
documents:
db.collection.findOne(): This method retrieves the first document that matches the query
criteria.
Update Operations:
Update Operations: These operations modify existing documents within a collection. MongoDB
provides the following methods for updating documents:
db.collection.updateOne(): This method is used to update a single document that matches the
provided filter criteria.
db.collection.updateMany(): It allows you to update multiple documents that match the filter
criteria.
Delete Operations:
These operations remove documents from a collection. MongoDB provides the following
methods for deleting documents:
db.collection.deleteMany(): This method is used to delete multiple documents that match the
filter criteria.
db.RecordsDB.deleteOne({name: "Maki"});
db.RecordsDB.deleteMany({species: "Dog"});
In summary, MongoDB CRUD operations are essential for managing data within a MongoDB
database. They allow you to create, read, update, and delete documents, providing the basic
functionality needed to interact with the database and manipulate data as needed for your
applications.