Chapter1_Sem4
Chapter1_Sem4
2
Chapter 1)
Database for Business Performance Improvement - OLAP & OLTP, Data Lake,
Data Warehousing (Concept, Features, Architecture & Analytical Techniques -
Roll up, Drill Down, Slicing, Pivot), Data Mining and Forecasting, Data Mart,
Data Backup (Concept & Types).
SSR 04/12/2025
3
Data
In general, data is any set of characters that is gathered and translated for
some purpose, usually analysis. If data is not put into context, it doesn't do
anything to a human or computer.
There are multiple types of data. Some of the more common types of data
include the following:
Single character
Boolean (true or false)
Text (string)
Number (integer or floating-point)
Picture
Sound
SSR Video 04/12/2025
4
In computing, data is information that has been translated into a form that is efficient for
movement or processing. Relative to today's computers and transmission media, data is
information converted into binary digital form. It is acceptable for data to be used as a
singular subject or a plural subject. Raw data is a term used to describe data in its most basic
digital format.
The concept of data in the context of computing has its roots in the work of Claude Shannon,
an American mathematician known as the father of information theory. He ushered in binary
digital concepts based on applying two-value Boolean logic to electronic circuits. Binary digit
formats underlie the CPUs, semiconductor memories and disk drives, as well as many of the
peripheral devices common in computing today. Early computer input for both control and
data took the form of punch cards, followed by magnetic tape and the hard disk.
SSR 04/12/2025
5
SSR 04/12/2025
6
How data is stored
SSR 04/12/2025
7
Hierarchy of Data
SSR 04/12/2025
8
SSR 04/12/2025
9
Bit (Character) - a bit is the smallest unit of data representation (value of a bit may be a 0 or 1).
Eight bits make a byte which can represent a character or a special symbol in a character code.
Field - a field consists of a grouping of characters. A data field represents an attribute (a
characteristic or quality) of some entity (object, person, place, or event).
Record - a record represents a collection of attributes that describe a real-world entity. A record
consists of fields, with each field describing an attribute of the entity.
File - a group of related records. Files are frequently classified by the application for which they are
primarily used (employee file). A primary key in a file is the field (or fields) whose value identifies
a record among others in a data file.
Database - is an integrated collection of logically related records or files. A database consolidates
records previously stored in separate files into a common pool of data records that provides data
for many applications. The data is managed by systems software called database management
systems (DBMS). The data stored in a database is independent of the application programs using it
and
SSR of the types of secondary storage devices on which it is stored. 04/12/2025
10
SSR 04/12/2025
11
Information
Information is the summarization of data. Technically, data are raw facts and
figures that are processed into information, such as summaries and totals. But
since information can also be the raw data for the next job or person, the two terms
cannot be precisely defined, and both are used interchangeably. It may be helpful
to view information the way it is structured and used, namely: data, text,
spreadsheets, pictures, voice and video. Data are discretely defined fields. Text is a
collection of words. Spreadsheets are data in matrix (row and column) form.
Pictures are lists of vectors or frames of bits. Voice is a continuous stream of sound
waves. Video is a sequence of image frames.
SSR 04/12/2025
12
SSR 04/12/2025
13
Centralized, Decentralized and
Distributed Systems
1. CENTRALIZED SYSTEMS:
Centralized systems are systems that use client/server architecture
where one or more client nodes are directly connected to a central
server. This is the most commonly used type of system in many
organisations where client sends a request to a company server and
receives the response.
SSR 04/12/2025
14
Example –
Wikipedia. Consider a massive server to which we send our requests
and the server responds with the article that we requested. Suppose we
enter the search term ‘junk food’ in the Wikipedia search bar. This
search term is sent as a request to the Wikipedia servers (mostly
located in Virginia, U.S.A) which then responds back with the articles
based on relevance. In this situation, we are the client node, wikipedia
servers are central server.
SSR 04/12/2025
15
These are another type of systems which have been gaining a lot of
popularity, primarily because of the massive hype of Bitcoin. Now many
organisations are trying to find the application of such systems.
In decentralized systems, every node makes its own decision. The final
behavior of the system is the aggregate of the decisions of the individual
nodes. Note that there is no single entity that receives and responds to
the request.
SSR 04/12/2025
17
Example –
Bitcoin. Lets take bitcoin for example because its the most popular use
case of decentralized systems. No single entity/organisation owns the
bitcoin network. The network is a sum of all the nodes who talk to each
other for maintaining the amount of bitcoin every account holder has.
SSR 04/12/2025
18
In decentralized systems, every node makes its own decision. The final
behaviour of the system is the aggregate of the decisions of the
individual nodes. Note that there is no single entity that receives and
responds to the request.
SSR 04/12/2025
20
Example –
Google search system. Each request is worked upon by hundreds of
computers which crawl the web and return the relevant results. To the
user, the Google appears to be one system, but it actually is multiple
computers working together to accomplish one single task (return the
results to the search query).
SSR 04/12/2025
21
SSR 04/12/2025
22
SSR 04/12/2025
23
Types of Data Processing
Batch Processing:
This is one of the widely used type of data processing which is also
known as Serial/Sequential, Tacked/Queued offline processing. The
fundamental of this type of processing is that different jobs of
different users are processed in the order received. Once the
stacking of jobs is complete they are provided/sent for processing
while maintaining the same order. This processing of a large
volume of data helps in reducing the processing cost thus making
it data processing economical. Batch Processing is a method where
the information to be organized is sorted into groups to allow for
efficient and sequential processing.
SSR 04/12/2025
24
Online Processing:
This processing method is a part of automatic processing method. This method at times known as
direct or random access processing. Under this method the job received by the system is
processed at same time of receiving. This can be considered and often mixed with real-time
processing. This system features random and rapid input of transaction and user defined/
demanded direct access to databases/content when needed. This is a method that utilizes
Internet connections and equipment directly attached to a computer. This allows the data to be
stored in one place and being used at an altogether different place. Cloud computing can be
considered as an example which uses this type of processing. It is used mainly for information
recording and research.
Distributed Processing:
This method is commonly utilized by remote workstations connected to one big central
workstation or server. ATMs are good examples of this data processing method. All the end
machines run on a fixed software located at a particular place and make use of exactly same
information and sets of instruction.
SSR 04/12/2025
26
Flat Files
SSR 04/12/2025
27
Database
Cost of Hardware and Software of a DBMS is quite high which increases the
budget of your organization,
Most database management systems are often complex systems, so the
training for users to use the DBMS is required,
In some organizations, all data is integrated into a single database which can
be damaged because of electric failure or database is corrupted on the storage
media,
Use of the same program at a time by many users sometimes lead to the loss
of some data,
DBMS can't perform sophisticated calculations.
SSR 04/12/2025
30
Why RDBMS ?
First of all, its number one feature is the ability to store data in tables. The fact that the very
storage of data is in a structured form can significantly reduce iteration time.
Data persists in the form of rows and columns and allows for a facility primary key to define
unique identification of rows.
It creates indexes for quicker data retrieval.
Allows for various types of data integrity like (i) Entity Integrity; wherein no duplicate rows in
a table exist, (ii)Domain Integrity; that enforces valid entries for a given column by filtering
the type, the format, or the wide use of values, (iii)Referential Integrity; which disables the
deletion of rows that are in use by other records and (iv)User Defined Integrity ; providing
some specific business rules that do not fall into the above three.
Also allows for the virtual table creation which provides a safe means to store and secure
sensitive content.
Common column implementation and also multi user accessibility is included in the RDBMS
features.
SSR 04/12/2025
32
Advantages of RDBMS
Data is stored only once and hence multiple record changes are not required. Also deletion
and modification of data becomes simpler and storage efficiency is very high.
Complex queries can be carried out using the Structure Query Language. Terms like ‘Insert’,
‘Update’, ‘Delete’, ‘Create’ and ‘Drop’ are keywords in SQL that help in accessing a particular
data of choice.
Better security is offered by the creation of tables. Certain tables can be protected by this
system. Users can set access barriers to limit access to the available content. It is very useful
in companies where a manager can decide which data is provided to the employees and
customers. Thus a customized level of data protection can be enabled.
Provision for future requirements as new data can easily be added and appended to the
existing tables and can be made consistent with the previously available content. This is a
feature that no flat file database has.
SSR 04/12/2025
34
Atomicity:
By this, we mean that either the entire transaction takes place at once or doesn’t happen at all.
There is no midway i.e. transactions do not occur partially. Each transaction is considered as one
unit and either runs to completion or is not executed at all. It involves the following two
operations.
Abort: If a transaction aborts, changes made to database are not
visible.
Commit: If a transaction commits, changes made are visible.
Consistency:
This means that integrity constraints must be maintained so that the database is consistent before
and after the transaction. It refers to the correctness of a database.
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total
SSR
after T occurs = 400 + 300 = 700. 04/12/2025
Therefore, database is consistent.
35
Isolation:
This property ensures that multiple transactions can occur concurrently without leading
to the inconsistency of database state. Transactions occur independently without
interference. Changes occurring in a particular transaction will not be visible to any other
transaction until that particular change in that transaction is written to memory or has
been committed. This property ensures that the execution of transactions concurrently
will result in a state that is equivalent to a state achieved these were executed serially in
some order.
Durability:
This property ensures that once the transaction has completed execution, the updates
and modifications to the database are stored in and written to disk and they persist even
if a system failure occurs. These updates now become permanent and are stored in non-
volatile memory. The effects of the transaction, thus, are never lost.
SSR 04/12/2025
36
SSR 04/12/2025
37
SSR 04/12/2025
38
SSR 04/12/2025
39
Concepts of RDBMS:
Entity:-
When an object becomes uniquely identifiable we can call it an
entity.
An entity can be of two types:
Tangible Entity: Tangible Entities are those entities which exist in
the real world physically. Example: Person, car, etc.
Intangible Entity: Intangible Entities are those entities which
exist only logically and have no physical
existence. Example: Bank Account, etc.
Tuples / Rows:
A single entry in a table is called a Tuple or Record or Row. A tuple in a table represents a set of related
data.
Table:
In Relational
SSR database model, a table is a collection of data elements organized in terms of rows and columns.
04/12/2025
A table is also considered as a convenient representation of relations.
41
RDBMS Keys
SUPER KEY:
Super Key is a set of attributes whose set of values can uniquely identify an entity instance in
the entity set. It contains one or more than one attributes. It is the broadest definition of
unique identifiers of an entity in an entity set.
The combination of “SSN” and “Name” is a super key of the following entity set customer.
CANDIDATE KEY:
Candidate key is a set of one or more attributes whose set of values can uniquely identify an
entity instance in the entity set. Any attribute in the candidate key cannot be omitted without
destroying the uniqueness property of the Candidate key. It is minimal Super Key. In building a
database in a database software, the software will only allow to use one candidate key to be
the unique identifier of an entity for an entity set.
SSR 04/12/2025
42
Example: • (SSN, Name) is NOT a candidate key, because taking out “name” still leaves “SSN” which
can uniquely identify an entity. “SSN” is a candidate key of customer.
Example: Both “SSN” and “License #” are candidate keys of Driver entity set. Customer-name
Customer-street customer SSN Customer-city
Overall, Super Key is the broadest unique identifier; Candidate Key is a subset of Super Key; and
Primary Key is a subset of Candidate Key. In practice, we would first look for Super Keys. Then we
look for Candidate Keys based on experience and common sense. If there is only one Candidate Key,
it naturally will be designated as the Primary Key. If we find more than one Candidate Key, then we
can designate any one of them as Primary Key.
SSR 04/12/2025
43
PRIMARY KEY:
The Primary Key is an attribute or a set of attributes that uniquely identify a specific instance
of an entity. Every entity in the data model must have a primary key whose values uniquely
identify instances of the entity.
To qualify as a primary key for an entity, an attribute must have the following properties: It
must have a non-null value for each instance of the entity The value must be unique for each
instance of an entity The values must not change or become null during the life of each entity
instance PRIMARY KEY Properties Of Primary Keys....
Primary and Foreign keys are the most basic components on which relational theory is based.
Each entity must have a attribute or attributes, the primary key, whose values uniquely
identify each instance of the entity. Every child entity must have an attribute, the foreign key,
that completes the association with the parent entity.
SSR 04/12/2025
44
FOREIGN KEY:
A Foreign key is an attribute that completes a relationship by identifying the parent entity. Foreign
keys provide a method for maintaining integrity in the data (called referential integrity) and for
navigating between different instances of an entity. Every relationship in the model must be
supported by a foreign key.
Every dependent and category (subtype) entity in the model must have a foreign key for each
relationship in which it participates. Foreign keys are formed in dependent and subtype entities by
migrating the entire primary key from the parent or generic entity. If the primary key is
composite, it may not be split.
COMPOSITE KEY:
When a primary key is created from a combination of 2 or more columns, the primary key is called
a composite
SSR key. Each column may not be unique by itself within the database table but04/12/2025
when
combined with the other column(s) in the composite key, the combination is unique.
45
SSR 04/12/2025
46
Referential integrity refers to the accuracy and consistency of data within a relationship.
In relationships, data is linked between two or more tables. This is achieved by having the foreign
key (in the associated table) reference a primary key value (in the primary – or parent –
table). Because of this, we need to ensure that data on both sides of the relationship remain
intact.
So, referential integrity requires that, whenever a foreign key value is used it must
reference a valid, existing primary key in the parent table.
For example, if we delete row number 15 in a primary table, we need to be sure that there’s
no foreign key in any related table with the value of 15. We should only be able to delete a
primary key if there are no associated rows. Otherwise, we would end up with an orphaned
record.
1. Domain constraints:
Domain constraints can be defined as the definition of a valid set of values for an
attribute.
The data type of domain includes string, character, integer, time, date, currency, etc.
The value of the attribute must be available in the corresponding domain.
4. Key constraints:
Keys are the entity set that is used to identify an entity within its entity set uniquely.
An entity set can have multiple keys, but out of which one key will be the primary key. A
primary key can contain a unique and null value in the relational table.
SSR 04/12/2025
49
SSR 04/12/2025
50
Relations in DBMS:
In relational databases, a relationship exists between two tables when one of them has a
foreign key that references the primary key of the other table. This single fact allows
relational databases to split and store data in different tables, yet still link the disparate
data items together. It is one of the features that makes relational databases such
powerful and efficient stores of information. Relation may also be known as relationship.
A database schema is the skeleton structure that represents the view of the entire database. It
defines how the data is organized and how the relations among them are associated. It formulates all
the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a descriptive
detail of the database.
The design of a database at physical level is called physical schema, how the data stored in
blocks of storage is described at this level.
Design of database at logical level is called logical schema, programmers and database
administrators work at this level, at this level data can be described as certain types of data
records gets stored in data structures, however the internal details such as implementation of
data structure is hidden at this level (available at physical level).
Design of database at view level is called view schema. This generally describes end user
interaction with database systems.
SSR 04/12/2025
52
Views
Views in DMBS are kind of virtual tables. A view also has rows and columns as they
are in a real table in the database. We can create a view by selecting fields from one
or more tables present in the database. A View can either have all the rows of a table
or specific rows based on certain condition.
Views can join and simplify multiple tables into a single virtual table. Views can
act as aggregated tables, where the database engine aggregates data (sum,
average, etc.) and presents the calculated results as part of the data. Views can
hide the complexity of data.
The difference between a view and a table is that views are definitions built on
top of other tables (or views), and do not hold data themselves. If data is
changing in the underlying table, the same change is reflected in the view.
SSR 04/12/2025
53
SSR 04/12/2025
54
Metadata
SSR 04/12/2025
55
SSR 04/12/2025
56
Data Dictionary
SSR 04/12/2025
57
SSR 04/12/2025
SQL 58
SQL stands for Structured Query Language. It lets you access and manipulate databases.
SQL (Structured Query Language) is a standardized programming language that's used to
manage relational databases and perform various operations on the data in them.
The uses of SQL include modifying database table and index structures; adding,
updating and deleting rows of data; and retrieving subsets of information from
within a database for transaction processing and analytics applications.
Queries and other SQL operations take the form of commands written as
statements -- commonly used SQL statements include select, add, insert,
update, delete, create, alter and truncate.
SSR 04/12/2025
59
SSR 04/12/2025
Data Definition Language (DDL): 60
DDL changes the structure of the table like creating a table, deleting a table, altering a
table, etc.
All the command of DDL are auto-committed that means it permanently save all the
changes in the database.
The command of DML is not auto-committed that means it can't permanently save all the
changes in the database. They can be rollback.
a. SELECT: This is the same as the projection operation of relational algebra. It is used to
select the attribute based on the condition described by WHERE clause.
b. INSERT: The INSERT statement is a SQL query. It is used to insert data into the row of a
table.
c. UPDATE: This command is used to update or modify the value of a column in the table.
d. DELETE: It is used to remove one or more row from a table.
SSR 04/12/2025
Transaction Control Language (TCL);
62
TCL commands can only use with DML commands like INSERT, DELETE and UPDATE only.
These operations are automatically committed in the database that's why they
cannot be used while creating tables or dropping them.
a. Commit: Commit command is used to save all the transactions to the database.
b. Rollback: Rollback command is used to undo transactions that have not already
been saved to the database.
c. SAVEPOINT: It is used to roll the transaction back to a certain point without rolling
back the entire transaction.
SSR 04/12/2025
63
SSR 04/12/2025
64
OLAP:
OLAP stands for On-Line Analytical Processing, a category of software tools which provide
analysis of data for business decisions. OLAP systems allow users to analyse database
information from multiple database systems at one time.
It is used for analysis of database information from multiple database systems at one time such as
sales analysis and forecasting, market research, budgeting and etc. Data Warehouse is the
example of OLAP system. Any Data-warehouse system is an OLAP system.
Example of OLAP
A company might compare their mobile phone sales in September with sales in October, then
compare those results with another location which may be stored in a separate database.
Amazon analyses purchases by its customers to come up with a personalized homepage with
products which likely interest to their customer.
SSR 04/12/2025
65
SSR 04/12/2025
OLTP: 66
Online transaction processing shortly known as OLTP supports transaction-oriented
applications in a 3-tier architecture. OLTP administers day to day transaction of an
organization.
OLTP stands for On-Line Transactional processing. It is used for maintaining the online
transaction and record integrity in multiple access environments. OLTP is a system that
manages very large number of short online transactions for example, ATM.
An example of OLTP system is ATM centre. Assume that a couple has a joint account with
a bank. One day both simultaneously reach different ATM centres at precisely the same
time and want to withdraw total amount present in their bank account.
• Online banking
• Purchasing a book online
• Booking an airline ticket
• Sending a text message
• Order entry
• Telemarketers entering telephone survey results
• Call center staff viewing and updating customers’ details
SSR 04/12/2025
68
SSR 04/12/2025
69
Data Architecture:
SSR 04/12/2025
70
What is Data Warehousing?
1. Subject oriented
A data warehouse is subject-oriented, as it provides information on a topic rather than the
ongoing operations of organizations. Such issues may be inventory, promotion, storage, etc.
Never does a data warehouse concentrate on the current processes. Instead, it emphasized
modelling and analysing decision-making data. It also provides a simple and succinct
description of the particular subject by excluding details that would not be useful in helping
the decision process.
2. Integrated
Integration in Data Warehouse means establishing a standard unit of measurement from the
different databases for all the similar data. The data must also get stored in a simple and
universally acceptable manner within the Data Warehouse. Through combining data from
various sources such as a mainframe, relational databases, flat files, etc., a data warehouse is
created. It must also keep the naming conventions, format, and coding consistent. Such an
application assists in robust data analysis. Consistency must be maintained in naming
conventions,
SSR measurements of characteristics, specification of encoding, etc. 04/12/2025
72
3. Time-variant
Compared to operating systems, the time horizon for the data warehouse is quite
extensive. The data collected in a data warehouse is acknowledged over a given period
and provides historical information. It contains a temporal element, either explicitly or
implicitly.
One such location in the record key system where Data Warehouse data shows time
variation is. Each primary key contained with the DW should have an element of time
either implicitly or explicitly. Just like the day, the month of the week, etc.
4. Non-volatile
Also, the data warehouse is non-volatile, meaning that prior data will not be erased when
new data are entered into it. Data is read-only, only updated regularly. It also assists in
analyzing historical data and in understanding what and when it happened. The
transaction process, recovery, and competitiveness control mechanisms are not required.
InSSRthe Data Warehouse environment, activities such as deleting, updating, and inserting
04/12/2025
that are performed in an operational application environment are omitted.
73
Analytical Techniques:
SSR 04/12/2025
74
SSR 04/12/2025
75
SSR 04/12/2025
76
SSR 04/12/2025
77
SSR 04/12/2025
78
Data Marts
SSR 04/12/2025
79
Data Lake
SSR 04/12/2025
80
What is a Data Mining?
Data mining is the process of uncovering patterns and finding anomalies and relationships in
large datasets that can be used to make predictions about future trends. The main purpose of
data mining is extracting valuable information from available data.
Data mining is considered an interdisciplinary field that joins the techniques of computer
science and statistics. Note that the term “data mining” is a misnomer. It is primarily
concerned with discovering patterns and anomalies within datasets, but it is not related to the
extraction of the data itself.
Applications:
Data mining offers many applications in business. For example, the establishment of proper
data (mining) processes can help a company to decrease its costs, increase revenues, or
derive insights from the behaviour and practices of its customers. Certainly, it plays a vital
role in the business decision-making process nowadays.
SSR 04/12/2025
81
Data mining is also actively utilized in finance. For instance, relevant techniques allow users
to determine and assess the factors that influence the price fluctuations of financial
securities.
The field is rapidly evolving. New data emerges at enormously fast speeds while
technological advancements allow for more efficient ways to solve existing problems. In
addition, developments in the areas of artificial intelligence and machine learning provide
new paths to precision and efficiency in the field.
SSR 04/12/2025
SSR 04/12/2025 84