0% found this document useful (1 vote)
45 views

CCS341_Data Warehousing_Unit 4 Notes

The document discusses dimensional modeling and schema in data warehousing, focusing on multi-dimensional data models, data cubes, and various schema types such as star, snowflake, and galaxy schemas. It outlines the structure and components of these schemas, including fact and dimension tables, and highlights the advantages and disadvantages of multi-dimensional data models. Additionally, it explains the process of building a multi-dimensional data model and the significance of OLAP tools in analyzing large datasets.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
45 views

CCS341_Data Warehousing_Unit 4 Notes

The document discusses dimensional modeling and schema in data warehousing, focusing on multi-dimensional data models, data cubes, and various schema types such as star, snowflake, and galaxy schemas. It outlines the structure and components of these schemas, including fact and dimension tables, and highlights the advantages and disadvantages of multi-dimensional data models. Additionally, it explains the process of building a multi-dimensional data model and the significance of OLAP tools in analyzing large datasets.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

RIT CCS341 - DATA WAREHOUSING 1

UNIT-IV

DIMENSIONAL MODELING AND SCHEMA

Dimensional Modeling- Multi-Dimensional Data Modeling – Data Cube- Star Schema-


Snowflake schema- Star Vs Snowflake schema- Fact constellation Schema- Schema
Definition - Process Architecture- Types of Data Base Parallelism – Data warehouse Tools

Multidimensional Model:
A multidimensional model views data in the form of a data-cube. A data cube enables data to be modelled
and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records. For
example, a shop may create a sales data warehouse to keep records of the store's sales for the dimension time,
item, and location. These dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example, a dimensional table for an item may
contain the attributes item name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This theme is
represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 2

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In
this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the
item dimension (classified according to the types of an item sold). The fact or measure displayed in rupee
sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data according to
time and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi.
These 3D data are shown in the table. The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 3

Working on a Multidimensional Data Model


The following stages should be followed by every project for building a Multi-Dimensional Data Model
Stage 1: Assembling data from the client - In first stage, a Multi-Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about the range
of data which can be gained with the selected technology and collect the complete data in detail.

Stage 2: Grouping different segments of the system - In the second stage, the Multi-Dimensional Data
Model recognizes and classifies all the data to the respective section they belong to and also builds it
problem-free to apply step by step.

Stage 3: Noticing the different proportions - In the third stage, it is the basis on which the design of the
system is based. In this stage, the main factors are recognized according to the user’s point of view. These
factors are also known as “Dimensions”.

Stage 4: Preparing the actual-time factors and their respective qualities - In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related qualities.
These qualities are also known as “attributes” in the database.

Stage 5: Finding the actuality of factors which are listed previously and their qualities - In the fifth
stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the factors which
are collected by it. These actually play a significant role in the arrangement of a Multi-Dimensional Data
Model.

Stage 6: Building the Schema to place the data, with respect to the information collected from the
steps above - In the sixth stage, on the basis of the data which was collected previously, a Schema is
built.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 4

Features of multidimensional data models:


Measures: Measures are numerical data that can be analyzed and compared, such as sales or revenue. They
are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location, or product. They
are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between measures and
dimensions in a data model. They provide a fast and efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels of detail. This
is a key feature of multidimensional data models, as it enables users to quickly analyze data at different levels
of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of data to a lower
level of detail, while roll-up is the opposite process of moving from a lower-level detail to a higher-level
summary. These features enable users to explore data in greater detail and gain insights into the underlying
patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example, a time
dimension might be organized into years, quarters, months, and days. Hierarchies provide a way to navigate
the data and perform drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that supports fast
and efficient querying of large datasets. OLAP systems are designed to handle complex queries and provide
fast response times.
Advantages of Multi-Dimensional Data Model
The following are the advantages of a multi-dimensional data model:
 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational databases).
 The representation of data is better than traditional databases. That is because the multi-dimensional
databases are multi-viewed and carry different types of factors.
 It is workable on complex systems and applications, contrary to the simple one-dimensional
database systems.
 The compatibility in this type of database is an upliftment for projects having lower bandwidth for
maintenance staff.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 5

Disadvantages of Multi-Dimensional Data Model


The following are the disadvantages of a Multi-Dimensional Data Model:
 The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to
recognize and examine the data in the database.
 During the work of a Multi-Dimensional Data Model, when the system caches, there is a great
effect on the working of the system.
 It is complicated in nature due to which the databases are generally dynamic in design.
 The path to achieving the end product is complicated most of the time.
 As the Multi-Dimensional Data Model has complicated systems, databases have a large number of
databases due to which the system is very insecure when there is a security break.
What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method
has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views,"
and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized
into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate function value
(such as total-sales) computed by grouping three attributes part, supplier, and customer, p indicates a view
composed of the corresponding aggregate function values calculated by grouping part alone, etc.

Data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure
attributes, i.e., the attributes whose values are of interest. Another attributes are selected as dimensions or
functional attributes. The measure attributes are aggregated according to the dimensions.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 6

For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions
time, item, branch, and location. These dimensions enable the store to keep track of things like monthly sales
of items, and the branches and locations at which the items were sold. Each dimension may have a table
identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to
make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model.
Data cubes usually model n-dimensional data.
A data cube enables data to be modelled and viewed in multiple dimensions. A multidimensional data model
is organized around a central theme, like sales and transactions. A fact table represents this theme. Facts are
numerical measures. Thus, the fact table contains measure (such as Rs. sold) and keys to each of the related
dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing
the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in
the city of Vancouver. The measured display in dollars sold (in thousands).

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 7

3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would
like to view the data according to time, item as well as the location for the cities Chicago, New York, Toronto,
and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown in the table.
The 3-D data of the table are represented as a series of 2-D tables.

Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:

Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a
supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 8

For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier
dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In
this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for
the dimension time, item, location, and supplier. Each cuboid represents a different degree of summarization.

Schemas Used in Data Warehouses: Star, Galaxy (Fact constellation), and Snowflake:

What Is a Data Warehouse Schema?

We can think of a data warehouse schema as a blueprint or an architecture of how data will be stored and
managed. A data warehouse schema isn’t the data itself, but the organization of how data is stored and how it
relates to other data components within the data warehouse architecture.

In the past, data warehouse schemas were often strictly enforced across an enterprise, but in modern imple-
mentations where storage is increasingly inexpensive, schemas have become less constrained. Despite this
loosening or sometimes total abandonment of data warehouse schemas, knowledge of the foundational
schema designs can be important to both maintaining legacy resources and for creating modern data ware-
house design that learns from the past.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 9

The basic components of all data warehouse schemas are fact and dimension tables. The different combination
of these two central elements compose almost the entirety of all data warehouse schema designs.
Fact Table

A fact table aggregates metrics, measurements, or facts about business processes. In this example, fact tables
are connected to dimension tables to form a schema architecture representing how data relates within the data
warehouse. Fact tables store primary keys of dimension tables as foreign keys within the fact table.

Dimension Table

Dimension tables are non-denormalized tables used to store data attributes or dimensions. As mentioned
above, the primary key of a dimension table is stored as a foreign key in the fact table. Dimension tables are
not joined together. Instead, they are joined via association through the central fact table.

3 Types of Schema Used in Data Warehouses

History presents us with three prominent types of data warehouse schema known as Star Schema, Snowflake
Schema, and Galaxy Schema. Each of these data warehouse schemas has unique design constraints and de-
scribes a different organizational structure for how data is stored and how it relates to other data within the
data warehouse.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 10

What Is a Star Schema in a Data Warehouse?

The star schema in a data warehouse is historically one of the most straightforward designs. This schema
follows some distinct design parameters, such as only permitting one central table and a handful of single-
dimension tables joined to the table. In following these design constraints, star schema can resemble a star
with one central table, and five dimension tables joined (thus where the star schema got its name).

Star Schema is known to create denormalized dimension tables – a database structuring strategy that organizes
tables to introduce redundancy for improved performance. Denormalization intends to introduce redundancy
in additional dimensions so long as it improves query performance.

Characteristics of the Star Schema:


 Star data warehouse schemas create a denormalized database that enables quick querying responses
 The primary key in the dimension table is joined to the fact table by the foreign key
 Each dimension in the star schema maps to one dimension table
 Dimension tables within a star scheme are not to be connected directly
 Star schema creates denormalized dimension tables

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 11

What Is a Snowflake Schema?

The Snowflake Schema is a data warehouse schema that encompasses a logical arrangement of dimension
tables. This data warehouse schema builds on the star schema by adding additional sub-dimension tables that
relate to first-order dimension tables joined to the fact table.

Just like the relationship between the foreign key in the fact table and the primary key in the dimension table,
with the snowflake schema approach, a primary key in a sub-dimension table will relate to a foreign key within
the higher order dimension table.

Snowflake schema creates normalized dimension tables – a database structuring strategy that organizes tables
to reduce redundancy. The purpose of normalization is to eliminate any redundant data to reduce overhead.

Characteristics of the Snowflake Schema:


 Snowflake Schema are permitted to have dimension tables joined to other dimension tables
 Snowflake Schema are to have one fact table only
 Snowflake Schema create normalized dimension tables
 The normalized schema reduces required disk space for running and managing this data warehouse
 Snowflake Scheme offer an easier way to implement a dimension

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 12

What Is a Galaxy Schema?

The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next iteration
of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy Schema uses
multiple fact tables connected with shared normalized dimension tables. Galaxy Schema can be thought of as
star schema interlinked and completely normalized, avoiding any kind of redundancy or inconsistency of data.

Characteristics of the Galaxy Schema:


 Galaxy Schema is multidimensional acting as a strong design consideration for complex database sys-
tems
 Galaxy Schema reduces redundancy to near zero redundancy as a result of normalization
 Galaxy Schema is known for high data quality and accuracy and lends to effective reporting and
analytics

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 13

Key Differences between Star, Snowflake, and Galaxy Schema:

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 14

Summary of Data Warehouse Schemas’:

To understand data warehouse schema and its various types at the conceptual level, here are a few things to
remember:

 Data warehouse schema is a blueprint for how data will be stored and managed. It includes
definitions of terms, relationships, and the arrangement of those terms and relationships.
 Star, galaxy, and snowflake are common types of data warehouse schema that vary in the
arrangement and design of the data relationships.
 Star schema is the simplest data warehouse schema and contains just one central table and a handful
of single-dimension tables joined together.
 Snowflake schema builds on star schema by adding sub-dimension tables, which eliminates
Redundancy and reduces overhead costs.
 Galaxy schema uses multiple fact tables (Snowflake and Star use only one) which makes it like an
Interlinked star schema. This nearly eliminates redundancy and is ideal for complex database
Systems.

Which Data Warehouse Schema is Best?

There’s no one “best” data warehouse schema. The “best” schema depends on (among other things) your
resources, the type of data you’re working with, and what you’d like to do with it.

For instance, star schema is ideal for organizations that want maximum simplicity and can tolerate higher disk
space usage. But galaxy schema is more suitable for complex data aggregation. And snowflake schema could
be superior for an organization that wants lower data redundancy without the complexity of star schema.

How StreamSets’ Schema-agnostic Approach Makes Schemas Easy

Our agnostic approach to schema management means that StreamSets data pipeline tools can manage any kind
of schema – simple, complex or non-existent. Meaning, with StreamSets you don’t have to spend hours match-
ing the schema from a legacy origin into your destination, instead StreamSets can infer any kind of schema
without you having to lift a finger. If however, you want to enforce a schema and create hard and fast validation
rules, StreamSets can help you with that as well. Our flexibility in how we manage schemas means your data
teams have less to figure out on their own and more time to spend on what really matters: your data.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 15

Data Warehouse Process Architecture

The process architecture defines an architecture in which the data from the data warehouse is processed for
a particular computation.

Following are the two fundamental process architectures:

Centralized Process Architecture


In this architecture, the data is collected into single centralized storage and processed upon completion by a
single machine with a huge structure in terms of memory, processor, and storage.

Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
It requires minimal resources both from people and system perspectives.

It is very successful when the collection and consumption of data occur at the same location.

Distributed Process Architecture

In this architecture, information and its processing are allocated across data centres, and its processing is
distributed across data centres, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized process
architectures where all the information needs to be collected to one central location, and results are available
in one central location.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 16

There are several architectures of the distributed process:

Client-Server

In this architecture, the user does all the information collecting and presentation, while the server does the
processing and management of data.

Three-tier Architecture

With client-server architecture, the client machines need to be connected to a server machine, thus mandating
finite states and introducing latencies and overhead in terms of record to be carried between clients and
servers.

N-tier Architecture

The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated into
multiple tiers.

Cluster Architecture

In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in a
cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.

Peer-to-Peer Architecture

This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of a
Client or server or just process data.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 17

Types of Database Parallelism

Parallelism is used to support speedup, where queries are executed faster because more resources, such as
processors and disks, are provided. Parallelism is also used to provide scale-up, where increasing workloads
are managed without increase response-time, via an increase in the degree of parallelism.

Different architectures for parallel database systems are shared-memory, shared-disk, shared-nothing, and
hierarchical structures.

(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and parallel
processing occurs within a specific task (i.e., table scan) that is performed concurrently on different
processors against different sets of data.

(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan, join, and
sort) are executed in parallel in a pipelined fashion. In other words, an output from one function (e.g., join)
as soon as records become available.

Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors and disks.
Using intraquery parallelism is essential for speeding up long-running queries.

Interquery parallelism
In this method it does not help in this function since each query is run sequentially.

This application of parallelism decomposes the serial SQL, query into lower-level operations such as scan,
join, sort, and aggregation.

These lower-level operations are executed concurrently, in parallel.

In interquery parallelism, different queries or transaction execute in parallel with one another.

This form of parallelism can increase transactions throughput. The response times of individual transactions
are not faster than they would be if the transactions were run in isolation.

Thus, the primary use of interquery parallelism is to scale up a transaction processing system to support a
more significant number of transactions per second.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 18

Database vendors started to take advantage of parallel hardware architectures by implementing multiserver
and multithreaded systems designed to handle a large number of client requests efficiently.

This approach naturally resulted in interquery parallelism, in which different server threads (or processes)
handle multiple requests at the same time.

Interquery parallelism has been successfully implemented on SMP systems, where it increased the
throughput and allowed the support of more concurrent users.

Data Warehouse Tools

The tools that allow sourcing of data contents and formats accurately and external data stores into the data
warehouse have to perform several essential tasks that contain:
 Data consolidation and integration.
 Data transformation from one form to another form.
 Data transformation and calculation based on the function of business rules that force transformation.
 Metadata synchronization and management, which includes storing or updating metadata about
source files, transformation actions, loading formats, and events.

There are several selection criteria which should be considered while implementing a data warehouse:

1. The ability to identify the data in the data source environment that can be read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required
data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data type and the character-set translation is a requirement when moving data
between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are necessary.
11. Vendor stability and support for the products are components that must be evaluated carefully.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA


RIT CCS341 - DATA WAREHOUSING 19

Data Warehouse Software Components:

A warehousing team will require different types of tools during a warehouse project. These software
products usually fall into one or more of the categories illustrated, as shown in the figure.

Extraction and Transformation


The warehouse team needs tools that can extract, transform, integrate, clean, and load information from a
source system into one or more data warehouse databases. Middleware and gateway products may be needed
for warehouses that extract a record from a host-based source system.

Warehouse Storage
Software products are also needed to store warehouse data and their accompanying metadata. Relational
database management systems are well suited to large and growing warehouses.

Data access and retrieval


Different types of software are needed to access, retrieve, distribute, and present warehouse data to its end-
clients.

UNIT 4: DIMENSIONAL MODELING AND SCHEMA

You might also like