Unit - 4
Unit - 4
Database cluster
Database clustering is the process of connecting more than one single database instance or
server to your system. In most common database clusters, multiple database instances are
usually managed by a single database server called the master. In the systems design world,
implementing such a design may be necessary especially in large systems (web or mobile
applications), as a single database server would not be capable of handling all of the customers’
requests. To fix this issue, the utilization of multiple database servers that work in parallel will be
introduced to the system.
It goes without saying that using such a technique comes with numerous benefits to our system
such as handling more users and overcoming system failures. One of the main disadvantages
of such implementation is the additional complexity introduced into the system. To handle
additional complexity, multiple database servers should be managed by a higher-level server
that monitors the flow of data throughout the system.
As shown in the above image, multiple database servers are connected together using
a SAN device. SAN short for Storage area network is a computer network device that
provides access to consolidated, block-level data storage. SANs are primarily used to
access data storage devices, such as disk arrays and tape libraries from servers so that
the devices appear to the operating system as direct-attached storage. While you still
can build your own database cluster, recently, companies do provide third-party cloud
database storage as a service for customers. Using such services customers can save
costs on maintaining and monitoring their own database servers or clusters.
Database Cluster Architecture
Shared-Nothing Architecture
To build a shared-nothing database architecture each database server must be
independent of all other nodes. Meaning that each node has its own database server to
store and access data from. In this type of architecture, no single database server is
master. Meaning that there is no one central database node that monitors and controls
the access of data in the system. Note that a shared-nothing architecture offers great
horizontal scalability as no resources are being shared between either nodes or
database servers.
Shared-Disk Architecture
On the other hand, we have the shared-disk architecture. In this architecture, all
nodes(CPU) share access to all the database servers available, subsequently having
access to all the system’s data. Unlike the shared-nothing architecture, the
interconnection network layer is between the CPU and the database servers allowing
for multiple database servers' access. It is worth noting that a shared disk cluster does
not offer much scalability when compared to the shared-nothing architecture, as if all
nodes share access to the same data a controlling node is required to monitor the data
flow in the system. The issue is that after exceeding a certain number of slave nodes,
the master node would be unable to monitor and control all the slave nodes efficiently.
○ The index is a type of data structure. It is used to locate and access the data in a
database table quickly.
Index structure:
Indexes can be created using some database columns.
○ The first column of the database is the search key that contains a copy of the
primary key or candidate key of the table. The values of the primary key are
stored in sorted order so that the corresponding data can be accessed easily.
○ The second column of the database is the data reference. It contains a set of
pointers holding the address of the disk block where the value of the particular
key can be found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted
are known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of
which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search
student with ID-543.
○ In the case of a database with no index, we have to search the disk block from
starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.
○ In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the
previous case.
Primary Index
○ If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. These primary keys are unique to each record and
contain 1:1 relation between the records.
○ As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
○ The primary index can be classified into two types: Dense index and Sparse
index.
Dense index
○ The dense index contains an index record for every search key value in the data
file. It makes searching faster.
○ In this, the number of records in the index table is same as the number of records
in the main table.
○ It needs more space to store index record itself. The index records have the
search key and a pointer to the actual record on the disk.
Sparse index
○ In the data file, index record appears only for a few items. Each item points to a
block.
○ In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.
Clustering Index
○ A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
○ In this case, to identify the record faster, we will group two or more columns to
get the unique value and create index out of them. This method is called a
clustering index.
○ The records which have similar characteristics are grouped, and indexes are
created for these group.
In the sparse indexing, as the size of the table grows, the size of mapping also grows.
These mappings are usually kept in the primary memory so that address fetch should
be faster. Then the secondary memory searches the actual data based on the address
got from mapping. If the mapping size grows then fetching the address itself becomes
slower. In this case, the sparse index will not be efficient. To overcome this problem,
secondary indexing is introduced.
○ If you want to find the record of roll 111 in the diagram, then it will search the
highest entry which is smaller than or equal to 111 in the first level index. It will
get 100 at this level.
○ Then in the second index level, again it does max (111) <= 111 and gets 110.
Now using the address 110, it goes to the data block and starts searching each
record till it gets 111.
Select columns that are Restrict selection to columns that are either
most frequently used to unique or highly singular.
access rows.
Select columns that are Equality conditions permit the system to hash
most frequently used in directly to the row having the conditional value.
equality predicate When the primary index is unique, the response
conditions. is never more than one row.
Inequality conditions require additional
processing.
Select columns that Distinct values distribute evenly across all AMPs
distribute rows evenly in the configuration. This maximizes parallel
across the AMPs. processing.
Rows having duplicate NUPI values hash to the
same AMP and often are stored in the same
data block. This is good when rows are only
moderately nonunique.
Rows having NUPI columns that are highly
nonunique distribute unevenly, use multiple data
blocks, and incur multiple I/Os.
Extremely nonunique primary index values can
skew space usage so markedly that the system
returns a message indicating that the database
is full even when it is not.
This occurs when an AMP exceeds the
maximum bytes threshold for a user or
database calculated by dividing the
PERMANENT = n BYTES specification by the
number of AMPs in the configuration, causing
the system to incorrectly perceive the database
to be “full.”
Select columns that are not Volatile columns force frequent row
volatile. redistribution.
Select columns having very If this guideline is not followed, row distribution
many more distinct values skews heavily, not only wasting disk space, but
than the number of AMPs in also devastating system performance.
the configuration. This rule is particularly important for large
tables.
Do not select columns You cannot specify columns that have BLOB,
defined with Period, CLOB, BLOB-based UDT, CLOB-based UDT,
ARRAY, VARRAY, XML-based UDT, Period, ARRAY, VARRAY,
Geospatial, JSON, XML, Geospatial, or JSON data types in a primary
BLOB, CLOB, XML-based index definition. If you attempt to do so, the
UDT, BLOB-based UDT, or CREATE request aborts.
CLOB-based UDT data
types.
You can, however, specify Period data type
columns in the partitioning expression of a
partitioned table.
Do not select aggregated When defining the primary index for a join
columns of a join index. index, you cannot specify any aggregated
columns.
There are a few basic rules to keep in mind when choosing indexes for a
database. A good index should have these three properties:
Table Size
It is not recommended to create indexes on small tables, as it takes the SQL Server Engine
less time scanning the underlying table than traversing the index when searching for a
specific data. In this case, the index will not be used and still affect the data modification
performance, as it will be always adjusted when modifying the underlying table’s data.
Table Columns
In addition to database workload characteristics, the characteristics of the table columns that
are used in the submitted queries should be also considered when designing an index. For
instance, the columns with exact numeric data types, such as INT and BIGINT data types and
that are UNIQUE and NOT NULL are considered optimal columns to participate in the index
key.
want to retrieve data from multiple tables, we need to perform some kind of join
operation on them. In that case, we use the denormalization technique that eliminates
efficiency of their database infrastructure. This method allows us to add redundant data
into a normalized database to alleviate issues with database queries that merge data
from several tables into a single table. The denormalization concept is based on the
For Example, We have two table students and branch after performing normalization.
The student table has the attributes roll_no, stud-name, age, and branch_id.
Additionally, the branch table is related to the student table with branch_id as the
A JOIN operation between these two tables is needed when we need to retrieve all
student names as well as the branch name. Suppose we want to change the student
name only, then it is great if the table is small. The issue here is that if the tables are big,
In this case, we'll update the database with denormalization, redundancy, and extra
effort to maximize the efficiency benefits of fewer joins. Therefore, we can add the
branch name's data from the Branch table to the student table and optimizing the
database.
Advantages of Denormalization
tables, but we already know that the more joins, the slower the query. To overcome this,
we can add redundancy to a database by copying values between parent and child
these values on-the-fly will take a longer time, slowing down the execution of the query.
fewer tables.
Suppose you need certain statistics very frequently. It requires a long time to create
them from live data and slows down the entire system. Suppose you want to monitor
client revenues over a certain year for any or all clients. Generating such reports from
live data will require "searching" throughout the entire database, significantly slowing it
down.
Disadvantages of Denormalization
○ Since data can be modified in several ways, it makes data inconsistent. Hence,
we'll need to update every piece of duplicate data. It's also used to measure
values and produce reports. We can do this by using triggers, transactions,
and/or procedures for all operations that must be performed together.
How is denormalization different from normalization?
○ Denormalization is used when joins are costly, and queries are run regularly on
the tables. Normalization, on the other hand, is typically used when a large
number of insert/update/delete operations are performed, and joins between
those tables are not expensive.
Database Tuning
Database Tuning in SQL is a set of activities performed to optimize a database and
There are various techniques with which you can configure the optimal performance of
a particular database. Database tuning overlaps with query tuning; so, good indexing
and avoiding improper queries help in increasing the database efficiency. In addition,
powerful CPU (if needed) are also some of the general techniques.
Database Normalization
normalize a database by breaking down larger tables into smaller related tables. This
increases the performance of database as it requires less time to retrieve data from
Proper Indexes
In SQL, indexes are the pointers (memory address) to the location of specific data in
database. We use indexes in our database to reduce query time, as the database
engine can jump to the location of a specific record using its index instead of scanning
a database. For example, choosing to retrieve an entire table when we only need the
data in a single column will unnecessarily increase query time. So, query the database
wisely.
Let us discuss some of the common improper queries made and how to rectify them to
In large databases, we should always retrieve only the required columns from the
database instead of retrieving all the columns, even when they are not needed. We can
easily do this by specifying the column names in the SELECT statement instead of using
Example
Assume we have created a table with name CUSTOMERS in MySQL database using
Following query inserts values into this table using the INSERT statement −
Let us say we only want the data in ID, NAME and SALARY columns of the CUSTOMERS
table. So, we should only specify those three columns in our SELECT statement as
shown below −
Output
The output obtained is as shown below −
ID NAME SALARY
1 Ramesh 2000.00
2 Khilan 1500.00
3 Kaushik 2000.00
4 Chaitali 6500.00
5 Hardik 8500.00
6 Komal 4500.00
7 Muffy 10000.00
2. Use Wildcards
Wildcards (%) are characters that we use to search for data based on patterns. These
wildcards paired with indexes only improves performance because the database can
Example
If we want to retrieve the names of all the customers starting with K from the
CUSTOMERS table, then, the following query will provide the quickest result −
Output
2 Khilan
3 Kaushik
6 Komal
SQL JOINs are used to combine two tables based on a common column. There are two
ways of creating a JOIN implicit join and explicit join. Explicit Join notation use the JOIN
keyword with the ON clause to join two tables while the implicit join notation does not
use the JOIN keyword and works with the WHERE clause.
Performance wise, they are both on the same level. However, in more complicated
cases, the implicit join notation might produce completely different results than
The DISTINCT operator in SQL is used to retrieve unique records from the database.
And on a properly designed database table with unique indexes, we rarely use it.
But, if we still have to use it on a table, using the GROUP BY clause instead of the
DISTINCT keyword shows a better query performance (at least in some databases).
degrades database performance as the entire table must be scanned multiple times to
into separate queries, which can be processed parallelly by the database. Then, the
Example
For example, let us say we have a requirement of getting the details of all the
customers whose age is greater than 25 or whose salary is greater than 2,000. The
UNION
Output
ID NAME
1 Ramesh
5 Hardik
4 Chaitali
6 Komal
7 Muffy
clause is more efficient than HAVING. With WHERE clause, only the records that match
the condition are retrieved. But with HAVING clause, it first retrieves all the records and
then filters them based on a condition. Therefore, the WHERE clause is preferable.
EXPLAIN − In SQL, the EXPLAIN command give us the order in which a query is
executed along with the estimated cost of each step. We can use this to find the
tkprof − tkprof is a command that gives us various statistics, such as CPU and
I/O usage of a query. By using these statistics, we can tune our queries to
reduce CPU and I/O utilization to increase the efficiency of our database.
Database Security
Database security is the practice of protecting sensitive data and information stored in a
techniques and strategies designed to ensure that only authorized individuals or entities
disclosure
alterations or corruption
theft through tangible means. Measures such as securing the server room,
implementing access controls for the data center, and employing security
database hardware from physical harm or theft, ensuring its integrity and
● Network Security:
● Access Control:
Authorization controls what actions each user can perform on the database
access the database and perform specific actions on the data. Access
● Data Encryption:
even if it falls into the wrong hands. Encryption also helps prevent data
Auditing and logging serve as vital methods for overseeing and tracing all
system events.
Auditing and logging are critical as they provide a record of all activities
performed on the database. This can be used to detect and prevent security
breaches. Auditing and logging also help meet regulatory and compliance
○ The damage to our brand's reputation: Customers or partners may not want to
purchase goods or services from us (or deal with our business) If they do not feel
they can trust our company to protect their data or their own.
○ The concept of business continuity (or lack of it): Some businesses cannot
continue to function until a breach has been resolved.
○ Penalties or fines to be paid for not complying: The cost of not complying with
international regulations like the Sarbanes-Oxley Act (SAO) or Payment Card
Industry Data Security Standard (PCI DSS) specific to industry regulations on
data privacy, like HIPAA or regional privacy laws like the European Union's General
Data Protection Regulation (GDPR) could be a major problem with fines in worst
cases in excess of many million dollars for each violation.
○ Costs for repairing breaches and notifying consumers about them: Alongside
notifying customers of a breach, the company that has been breached is required
to cover the investigation and forensic services such as crisis management,
triage repairs to the affected systems, and much more.
Insider Dangers
An insider threat can be an attack on security from any three sources having an access
privilege to the database.
○ An insider who is negligent and makes mistakes that expose the database to
attack. vulnerable to attacks
Insider dangers are among the most frequent sources of security breaches to
databases. They often occur as a consequence of the inability of employees to have
access to privileged user credentials.
Human Error
The unintentional mistakes, weak passwords or sharing passwords, and other negligent
or uninformed behaviours of users remain the root causes of almost half (49 percent) of
all data security breaches.
Hackers earn their money by identifying and exploiting vulnerabilities in software such
as databases management software. The major database software companies and
open-source databases management platforms release regular security patches to fix
these weaknesses. However, failing to implement the patches on time could increase
the risk of being hacked.
A specific threat to databases is the infusing of untrue SQL as well as other non-SQL
string attacks in queries for databases delivered by web-based apps and HTTP headers.
Companies that do not follow the safe coding practices for web applications and
conduct regular vulnerability tests are susceptible to attacks using these.
Buffer overflow happens when a program seeks to copy more data into the memory
block with a certain length than it can accommodate. The attackers may make use of
the extra data, which is stored in adjacent memory addresses, to establish a basis for
they can begin attacks.
In a denial-of-service (DoS) attack in which the attacker overwhelms the targeted server
-- in this case, the database server with such a large volume of requests that the server
is unable to meet no longer legitimate requests made by actual users. In most cases,
the server is unstable or even fails to function.
Malware
Attacks on Backups
Companies that do not protect backup data using the same rigorous controls employed
to protect databases themselves are at risk of cyberattacks on backups.
○ Data volumes are growing: Data capture, storage, and processing continue to
increase exponentially in almost all organizations. Any tools or methods must be
highly flexible to meet current as well as far-off needs.
○ Optimization of Data Security and Risk Analysis: An application that will provide
contextual insights through the combination of security data with advanced
analytics will allow users to perform optimizing, risk assessment, and reporting
in a breeze. Select a tool that is able to keep and combine large amounts of
recent and historical data about the security and state of your databases. Also,
choose a solution that provides data exploration, auditing, and reporting
capabilities via an extensive but user-friendly self-service dashboard.
Businesses and organizations heavily rely on databases, which store sensitive and
Here are the data security best practices that will help you to know how to secure your
database:
● Use Strong Passwords
leaves the company. This ensures the previous employee cannot access the
● Limit Access
access to sensitive data. Not all employees or users require access to all the
access to customer data, they should only have access to that section of the
database.
any known security vulnerabilities are fixed. It shows that the database is
protected against potential threats. Failure to apply updates and patches can
DCL is an abbreviation for Data Control Language in SQL. It is used to provide different
users access to the stored data. It enables the data administrator to grant or revoke the
required access to act as the database. When DCL commands are implemented in the
○ DCL, DDL, DML, DQL, and TCL commands form the SQL (Structured Query
Language).
○ DCL commands are primarily used to implement access control on the data
stored in the database. It is implemented along the DML (Data Manipulation
Language) and DDL (Data Definition Language) commands.
useful, especially when several users access the database. It enables the administrator
to manage access control. The two types of DCL commands are as follows:
○ GRANT
○ REVOKE
GRANT Command
GRANT, as the name itself suggests, provides. This command allows the administrator
component operations.
In simple language, the GRANT command allows the user to implement other SQL
commands on the database or its objects. The primary function of the GRANT
command in SQL is to provide administrators the ability to ensure the security and
the database. Suppose you want a specific user Aman to only SELECT (read)/ retrieve
the data from the student table. Then you can use GRANT in the below GRANT
statement.
This command will allow Aman to implement the SELECT queries on the student table.
This will enable the user to read or retrieve information from the student table.
REVOKE Command
As the name suggests, revoke is to take away. The REVOKE command enables the
from a user over a database or database object, such as a table, view, or procedure. The
REVOKE commands prevent the user from accessing or performing a specific operation
In simple language, the REVOKE command terminates the ability of the user to perform
the mentioned SQL command in the REVOKE query on the database or its component.
The primary reason for implementing the REVOKE query in the database is to ensure the
Let us use an example to better understand how to implement the REVOKE command in
SQL.
implementation of the GRANT command, the user Aman was provided permission to
implement a SELECT query on the student table that allowed Aman to read or retrieve
the data from the table. Due to certain circumstances, the administrator wants to revoke
the abovementioned permission. To do so, the administrator can implement the below
REVOKE statement:
This will stop the user Aman from implementing the SELECT query on the student table.
database. Let's see some most common reasons why the user implements DCL
3. Risk of human error: Human administrators execute DCL commands and can
make mistakes in granting or revoking privileges. Thus, giving unauthorized
access to data or imposing unintended restrictions on access.