0% found this document useful (0 votes)
27 views17 pages

DWH Question

A data warehouse is an integrated, time-variant, subject-oriented, and non-volatile database used for business analysis and decision-making. It can be designed using top-down or bottom-up approaches, with the latter being preferred in real-time scenarios due to its advantages in simplicity, time, cost, and security. Data warehouses utilize various databases and tools for ETL processes, reporting, and data modeling, with dimensional modeling being common for organizing data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views17 pages

DWH Question

A data warehouse is an integrated, time-variant, subject-oriented, and non-volatile database used for business analysis and decision-making. It can be designed using top-down or bottom-up approaches, with the latter being preferred in real-time scenarios due to its advantages in simplicity, time, cost, and security. Data warehouses utilize various databases and tools for ETL processes, reporting, and data modeling, with dimensional modeling being common for organizing data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

1. What is a data warehouse?

According to Inmon a data warehouse is a time variant, integrated, subject


oriented and non-volatile database.

According to Ralph Kimball a data warehouse is a historical database which


is used for business analysis and decision making.

A data warehouse is collection of tables with some relationships. A data


warehouse can be build on the top of database.

2. Why data warehouse?

Data warehouse is used for decision making and forecasting by top level
management across all domains/departments in a business.

3. What are the characteristics of data warehouse?

Time Variant: The data warehouse is designed in such a way that it allows
all business users to compare data across time periods. You can compare
present year data with previous year, current quarter with previous
quarter, current month with previous month, current week with previous
week and current day with previous day as well.

Integrated: Data warehouse contains data from multiple source systems or


OLTP systems.

Subject Oriented: Data warehouse contains all subjects or departments


information. All top level management from each subject access data
warehouse and can take decisions.

Non-Volatile: Non-volatile means frequency of changing data from data


warehouse and source system will differ.

4. What is a data mart?


A data mart is a subset of enterprise data warehouse. At any point of time
data mart contains single subject information or single department
information. If it contains only sales information then we call it as sales data
mart, if it contains only finance information then we call it as finance data
mart.

5. What is an enterprise data warehouse?

An enterprise data warehouse contains all subjects information or all


departments information. Sales, finance, manufacturing, supply chain, IT
etc.

6. What are the possible databases used to build a data warehouse?

There are different types of databases which are available in the market.
DB2, Sybase, SQL Server, Oracle, Teradata, Vertica, Netizza. We can choose
any of these databases to build a data warehouse.

7. How many ways you can design a data warehouse?

There are two ways to design a data warehouse.


a) Top-down approach
b) Bottom-up approach

8. What is Top-down approach?

Top-down approach is suggested by the author Inmon. In this approach first


we build enterprise data warehouse and then we extract data from
enterprise data warehouse and build small data marts.

9. What is Bottom-up approach?

Bottom-up approach is suggested by the author Ralph Kimball. In this


approach first we build small data marts and then we build enterprise data
warehouse by extracting data from all these data marts.

10. Which data warehouse design we follow in real-time?


In real-time we always follow Bottom-up approach. There are many
advantages of following this approach.

11. What are the advantages of following Bottom-up approach?


OR
What are the advantages of building data mart?

a) Requirements gathering is simple and connect with sales people only to


build sales data mart as example

b) It takes less time to build 6 to 9 months

c) Client or company needs to pay less amount

d) Security is also high, you are construction sales data mart and providing
access to only sales people

12. What is OLTP?

OLTP means online transaction processing. OLTP systems are mainly used
to run businesses across countries or continents.

13. What is OLAP?

OLAP means online analytical processing. OLAP systems are mainly used for
analytical purpose.

14. What is the difference between OLTP and OLAP?

OLTP systems contains normalized OLAP systems contains denormalized


data data

OLTP systems contains detailed OLAP systems contains aggregated


data data

OLTP systems are for clerical OLAP systems are for managerial
access access

OLTP systems contains partial OLAP systems contains complete


history history

Data is volatile in OLTP Data is non volatile in OLAP


systems systems

15. Can you provide data warehouse architecture?

This is the generic architecture of a data warehouse. First extract data from
multiple source systems and then load data into staging area. From staging
area extract data and load data into dimensions and fact tables.

16. What is business intelligence?


Business intelligence refers to a variety of software applications, strategies,
processes used to analyze an organization’s raw data.

17. How data warehouse project build in real time?

In real time we have two teams for any data warehouse or business
intelligence projects. ETL team and Reporting team. ETL team extract data
from multiple sources and load data into data warehouse. The reporting to
team use data warehouse as source and produce reports according to
business requirements.

18. What is ETL?

E- Extraction
T- Transformation
L - Loading

Extraction means fetching or pulling data from multiple sources.


Transformation means applying business rules on the extracted data,
loading means finally inserting data into data warehouse.

19. What are different types of ETL tools available in the market?

Below are the different types of ETL tools in the market.


Informatica
Datastage
Abinitio
Talend
Pentaho etc

20. What are different types of reporting tools available in the market?

Below are different types of reporting tools in the market.


OBIEE
Business Objects
SAS
Micro Strategy
Cognos etc

21. What is data acquisition?

Data acquisition is the process of extracting data from multiple sources and
then applying business rules on the extracted data and then finally loading
data into data warehouse.
22. What is data merging?

Data merging means combining data from multiple source systems. We use
join and union operations to perform this. In order to join two sources, you
should have at least one common column. In order to use Union you should
have same no.of columns an data types in both the sources.

Join: Column level concatenation


Union: Row level concatenation

23. What is data cleansing?

Data cleansing means removing unwanted data or inconsistent data. For


example you have a table Employee and column is city, values are coming
different ways from different source systems

S1- Hyd S2- hyd S3- Hyderabad S4-hyderabad S5- HYDERABAD

In DWH we maintain all values as "HYDERABAD" based on business


decision. We convert S1- Hyd value to HYDERABAD, S2- hyd value to
HYDERABAD, S3- Hyderabad value to HYDERABAD and S4-hyderabad value
to HYDERABAD.

Another example source system1 amt contains two decimal places 100.56,
source system2 contains 3 decimal places 200.345, source system3 contains
1 decimal place 300.1, while integration all these source systems data we
convert all values into 3 decimal places.

24. What is data scrubbing?

Data Scrubbing means deriving new column values based on existing


column values. For example in an employee table we have columns eno,
ename, sal coming from the source, need to calculate tax based on salary,
but tax is not coming from the source, we are deriving the value tax, this
operation is known as data scrubbing.
25. What is data aggregation?

Data aggregation means calculating values based on multiple rows of


transactional data. We perform different kinds of data aggregation using
aggregate functions such as sum(), max(), min(), count(), avg() etc

26. What is data model?

A data model is a database design which describes tables and relationships


with in the database. Data model helps in organizing the data in an efficient
manner.

27. What are different phases of data model?

There are three different phases in data modeling.

a) Conceptual Data Modeling


b) Logical Data Modeling
c) Physical Data Modeling

28. What is conceptual data modeling?

This is the first phase of the data modeling. After analyzing FSD (Functional
Specification Docuemnt) and BRD (Business Requirement Document) we
identify the list of entities and relationships between them. No attributes
and keys are defined in this phase.

29. What is logical data modeling?

This is the second phase of the data modeling. In this phase we define
attributes for entities and keys in each entity i.e includes primary key and
foreign keys. In this phase we go for approvals from Data Architect Team.

30. What is physical data modeling?


This is the third phase of the data modeling. In this phase we define table
names, column names, data types, constraints etc according the standards.
We use this model to create database objects in the database.

31. Which data model used in data warehousing projects?

Dimensional modeling is used in data warehousing projects. Dimensional


modeling includes star schema, snow flake schema and hybrid schema.
Hybrid schema includes combination of star and snow flake schema. All
these schema's contains dimension tables and fact tables.
Relationships exist between dimension tables and fact tables.

32. What are different types of data modeling tools?

In Data warehousing projects we use different types of data modeling


tools.
Erwin
Open Model Sphere
Model Right etc

33. What is star schema?

Star Schema is a Dimensional Model, centrally located fact table


surrounded by multiple dimension tables. In this schema, each dimension
table has its own primary key. Fact tables contain primary keys, and foreign
keys from each dimension table.

34. What is snow flake schema?

Snow Flake schema is similar to star schema except that there is a


hierarchical relationship between dimension to dimension. Snow flake
schema also contains dimensions and facts. Data is normalized in snow
flake schema, which leads to occupy less space in the database. Compare
to star schema, snow flake schema performance is low, need to join more
number of tables in order to get relevant data for business reports.
35. What are the advantages of star schema?

The main advantages of star schema are:

a) Easy To Understand: Relationships between dimension and fact are


based on primary and foreign keys.

b) High Performance: In order to get relevant data for the business reports,
need to join only dimensions and facts, leads to less number of joins.

36. When we go for start schema and when we go for snow flake schema?

In general in DWH projects we prefer to use Star Schema only. Star schema
is a powerful design in performance consideration. We go for snow flake
schema in DWH projects in below cases.

a) Database Space is a constraint: Data is normalized in snow flake schema,


so less chance of data redundancy leads to occupy less space.

b) No.of columns in the table are more than 500: If the number of columns
are huge in a table, the retrieval speed is less. To overcome this we split
columns into multiple tables.

37. What is a dimension?

A dimension represents descriptive information or textual information. The


data modeler identifies attributes which are come under dimension table.
Common examples for dimension tables are customer, time and product
etc.

38. What are different types of dimensions?


There are different types of dimensions
Confirmed Dimension
Junk Dimension
Degenerated Dimension
Role-playing Dimension
Rapidly changing Dimension
Inferred Dimension

39. What is confirmed dimension?


A dimension shared by multiple fact tables in a data warehouse or across all
data warehouses is called a confirmed dimension. Time dimension is an
example of confirmed dimension.

40. What is junk dimension?


A junk dimension is a collection of random transactional codes flags and/or
text attributes that are unrelated to any particular dimension. The junk
dimension is simply a structure that provides a convenient place to store
the junk attributes.

Eg: Assume that we have a gender dimension and marital status dimension.
In the fact table we need to maintain two keys referring to these
dimensions. Instead of that create a junk dimension which has all the
combinations of gender and marital status (cross join gender and marital
status table and create a junk table). Now we can maintain only one key in
the fact table.

41. What is degenerated dimension?


A degenerate dimension is a dimension which is derived from the fact table
and doesn’t have its own dimension table.

In Data warehouse this Dimension often used to show drill through


capability where in the report you can see how aggregated no came up.

42. What is Role playing dimension?


A role-playing dimension is one where the same dimension key — along
with its associated attributes — can be joined to more than one foreign key
in the fact table. For example, a fact table may include foreign keys for both
ship date and delivery date. But the same date dimension attributes apply
to each foreign key, so you can join the same dimension table to both
foreign keys. Here the date dimension is taking multiple roles to map ship
date as well as delivery date, and hence the name of role playing
dimension.
43. What is Rapidly changing dimension?
A dimension attribute that changes frequently is a rapidly changing
attribute. If you don’t need to track the changes, the rapidly changing
attribute is no problem, but if you do need to track the changes, using a
standard slowly changing dimension technique can result in a huge inflation
of the size of the dimension. One solution is to move the attribute to its
own dimension, with a separate foreign key in the fact table. This new
dimension is called a rapidly changing dimension.

44. What is inferred dimension?


While loading fact records, a dimension record may not yet be ready. One
solution is to generate a surrogate key with null for all the other attributes.
This should technically be called an inferred member, but is often called an
inferred dimension.

45. What is a fact?

A fact represents numeric or numbers. Every numeric or number is not a


fact. A numeric value which is used for business analysis is considered as a
fact. For example customer number is not a fact even though it is
number. Quantity is a fact, we can do business analysis on this, how much
quantity sold in a particular location, how much quantity sold in a particular
year etc

46. What are different types of facts?


There are 3 different types of facts
Additive
Semi-Additive
Non-Additive

47. What is Additive fact?


Additive facts are facts that can be summed up through all of the
dimensions in the fact table. A sales fact is a good example for additive fact.

48. What is Semi-Additive fact?


Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers
dimension but not through the time dimension.

49. What is Non-Additive fact?


Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

50. What is SCD?

SCD means slowly changing dimensions, when ever there is a change in the
source, what we need to do in our data warehouse tables (insert/update).

51. What are different types of SCDs?


There are 4 different types of SCDs (Slowly Changing Dimensions)
SCD TYPE-I
SCD TYPE-II
SCD TYPE-II

52. What is SCD TYPE-I?


SCD TYPE-I represents only current information. When ever there is change
in the source data, if it is a new record we insert it, if it is a old record we
update it. At any point of time we can see only latest information.

53. What is SCD TYPE-II?


SCD TYPE-II represents complete history. When ever there is change in the
source data, if it is a new record or old record with changes exist we will
insert it. We can identify latest record by using FLAG='Y' or END_DT='31-
dec-9999'
Eg:- 100 ravi 2000 N 1
100 ravi 4000 N 2
100 ravi 6000 Y 3
101 ramu 4000 Y 4
Eg:-
eno ename sal start_dt end_dt eid
100 ravi 2000 10-dec-15 10-jan-16 1
100 ravi 4000 11-jan-16 20-dec-16 2
100 ravi 6000 21-dec-16 31-dec-99 3
101 ramu 4000 15-dec-16 31-dec-99 4

45. What is SCD TYPE-III?


SCD TYPE-III represents partial history. Partial history means only current
and previous information. In SCD TYPE-III we maintain two columns
(present and previous).

Eg:- eno ename curr_sal prev_sal


100 ravi 4000 2000
101 vinay 6000 3000

54. What is data purging?

Data purging is the process of deleting data from the data warehouse.
Based on business requirement we maintain 5 or 7 or 10 years of data in
DWH dimension and fact tables, before that data we delete data from
dimensions and fact tables. We create separate mappings to purge or
delete data from dimensions and facts.

These mappings we schedule yearly once to delete old data. Based on key
columns we delete data from dimensions and facts along with update
strategy transformation. Data purging process helps to delete unnecessary
data from the DWH and leads to performance efficiency.

55. What is factless fact table?


In the real world, it is possible to have a fact table that contains no
measures or facts. These tables are called "Factless Fact tables".

Eg: A fact table which has only product key and date key is a factless fact.
There are no measures in this table. But still you can get the number
products sold over a period of time.

56. What is galaxy schema?


Galaxy schema contains many fact tables with some common dimensions
(conformed dimensions). This schema is a combination of many data marts.

57. What are different types of data loads in data warehousing projects?
There are two different types of data loads in data warehousing project
History Data Loads
Incremental Data Loads or Delta Data Loads
Once we develop data warehouse, as part of history data loads we extract
entire data from multiple source systems into warehouse. Incremental data
loads (daily/weekly/monthly) will happen after history data loads.

58. What is granularity of fact table?


Granularity represents lowest level of information that will be stored in the
fact table.
Eg:- Employee performance is a very high level of granularity.
Employee_performance_daily, employee_perfomance_weekly can be
considered lower levels of granularity.

59. What is surrogate key?


Surrogate key is an artificial key which is not coming from the source, using
sequence generator we can generate sequence of numbers. In SCD TYPE-II
we use surrogate key.

60. What is staging area?


The Data Warehouse Staging Area is temporary location where data from
source systems is copied. A staging area is mainly required in a Data
Warehousing Architecture for timing reasons. In short, all required data
must be available before data can be integrated into the Data Warehouse.

Due to varying business cycles, data processing cycles, hardware and


network resource limitations and geographical factors, it is not feasible to
extract all the data from all Operational databases at exactly the same time.

For example, it might be reasonable to extract sales data on a daily basis,


however, daily extracts might not be suitable for financial data that
requires a month-end reconciliation process. Similarly, it might be feasible
to extract "customer" data from a database in Singapore at noon eastern
standard time, but this would not be feasible for "customer" data in a
Chicago database.

61. What are different phases of DWH project life cycle?


There are different phases
Requirements Gathering
Design
Build
Testing
Production or go live

62. What is hybrid SCD?


Hybrid SCDs are a combination of both SCD 1 and SCD 2.It may happen that
in a table, some columns are important and we need to track changes for
them i.e., capture the historical data for them whereas in some columns
even if the data changes, we do not have to bother.For such tables, we
implement Hybrid SCDs, where in some columns are Type 1 and some are
Type 2.

63. What are the commonly used indexes in data warehouse?


B-tree indexes
Bitmap indexes

64. What is ODS?


An Operational Data Store (ODS) is an integrated database of operational
data. Its sources include legacy systems and it contains current or near
term data. An ODS may contain 30 to 60 days of information, while a data
warehouse typically contains years of data.

65. What is the difference between ODS and DWH?


An operational data store is basically a database that is used for being an
interim area for a data warehouse. As such, its primary purpose is for
handling data which are progressively in use such as transactions, inventory
and collecting data from Point of Sales. It works with a data warehouse but
unlike a data warehouse, an operational data store does not contain static
data. Instead, an operational data store contains data which are constantly
updated through the course of the business operations.
ODS is specially designed such that it can quickly perform relatively simply
queries on smaller volumes of data such as finding orders of a customer or
looking for available items in the retails store. This is in contrast to the
structure of a data warehouse wherein one needs to perform complex
queries on high volumes of data. As a simple analogy, a data store may be a
company’s short term memory storing only the most recent information
while the data warehouse is the long term memory which also serves as a
company’s historical data repository whose data stored are relatively
permanent.

66. What is data mining?


Data mining is the practice of automatically searching large stores of data
to discover patterns and trends that go beyond simple analysis. Data mining
uses sophisticated mathematical algorithms to segment the data and
evaluate the probability of future events. Data mining is also known as
Knowledge Discovery in Data (KDD).

67. What are the key properties of data mining?


The key properties of data mining are:
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large data sets and databases

68. What is normalization?


Normalization, is the process of organizing the columns (attributes) and
tables (relations) of a relational database to reduce data redundancy and
improve data integrity.

69. What is first normalization?


A relation is in first normal form if and only if the domain of each attribute
contains only atomic (indivisible) values, and the value of each attribute
contains only a single value from that domain.

70. What is second normalization?


A table is in second normal form if it satisfies the following conditions:
It is in first normal form
All non-key attributes are fully functional dependent on the primary key

You might also like