CCS341_Data Warehousing_Unit 4 Notes
CCS341_Data Warehousing_Unit 4 Notes
UNIT-IV
Multidimensional Model:
A multidimensional model views data in the form of a data-cube. A data cube enables data to be modelled
and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records. For
example, a shop may create a sales data warehouse to keep records of the store's sales for the dimension time,
item, and location. These dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table related to it, called a
dimensional table, which describes the dimension further. For example, a dimensional table for an item may
contain the attributes item name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This theme is
represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In
this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the
item dimension (classified according to the types of an item sold). The fact or measure displayed in rupee
sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to
time and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi.
These 3D data are shown in the table. The 3D data of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:
Stage 2: Grouping different segments of the system - In the second stage, the Multi-Dimensional Data
Model recognizes and classifies all the data to the respective section they belong to and also builds it
problem-free to apply step by step.
Stage 3: Noticing the different proportions - In the third stage, it is the basis on which the design of the
system is based. In this stage, the main factors are recognized according to the user’s point of view. These
factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities - In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related qualities.
These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities - In the fifth
stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the factors which
are collected by it. These actually play a significant role in the arrangement of a Multi-Dimensional Data
Model.
Stage 6: Building the Schema to place the data, with respect to the information collected from the
steps above - In the sixth stage, on the basis of the data which was collected previously, a Schema is
built.
When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method
has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views,"
and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized
into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate function value
(such as total-sales) computed by grouping three attributes part, supplier, and customer, p indicates a view
composed of the corresponding aggregate function values calculated by grouping part alone, etc.
Data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure
attributes, i.e., the attributes whose values are of interest. Another attributes are selected as dimensions or
functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions
time, item, branch, and location. These dimensions enable the store to keep track of things like monthly sales
of items, and the branches and locations at which the items were sold. Each dimension may have a table
identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to
make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model.
Data cubes usually model n-dimensional data.
A data cube enables data to be modelled and viewed in multiple dimensions. A multidimensional data model
is organized around a central theme, like sales and transactions. A fact table represents this theme. Facts are
numerical measures. Thus, the fact table contains measure (such as Rs. sold) and keys to each of the related
dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing
the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in
the city of Vancouver. The measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would
like to view the data according to time, item as well as the location for the cities Chicago, New York, Toronto,
and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown in the table.
The 3-D data of the table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a
supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier
dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In
this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for
the dimension time, item, location, and supplier. Each cuboid represents a different degree of summarization.
Schemas Used in Data Warehouses: Star, Galaxy (Fact constellation), and Snowflake:
We can think of a data warehouse schema as a blueprint or an architecture of how data will be stored and
managed. A data warehouse schema isn’t the data itself, but the organization of how data is stored and how it
relates to other data components within the data warehouse architecture.
In the past, data warehouse schemas were often strictly enforced across an enterprise, but in modern imple-
mentations where storage is increasingly inexpensive, schemas have become less constrained. Despite this
loosening or sometimes total abandonment of data warehouse schemas, knowledge of the foundational
schema designs can be important to both maintaining legacy resources and for creating modern data ware-
house design that learns from the past.
The basic components of all data warehouse schemas are fact and dimension tables. The different combination
of these two central elements compose almost the entirety of all data warehouse schema designs.
Fact Table
A fact table aggregates metrics, measurements, or facts about business processes. In this example, fact tables
are connected to dimension tables to form a schema architecture representing how data relates within the data
warehouse. Fact tables store primary keys of dimension tables as foreign keys within the fact table.
Dimension Table
Dimension tables are non-denormalized tables used to store data attributes or dimensions. As mentioned
above, the primary key of a dimension table is stored as a foreign key in the fact table. Dimension tables are
not joined together. Instead, they are joined via association through the central fact table.
History presents us with three prominent types of data warehouse schema known as Star Schema, Snowflake
Schema, and Galaxy Schema. Each of these data warehouse schemas has unique design constraints and de-
scribes a different organizational structure for how data is stored and how it relates to other data within the
data warehouse.
The star schema in a data warehouse is historically one of the most straightforward designs. This schema
follows some distinct design parameters, such as only permitting one central table and a handful of single-
dimension tables joined to the table. In following these design constraints, star schema can resemble a star
with one central table, and five dimension tables joined (thus where the star schema got its name).
Star Schema is known to create denormalized dimension tables – a database structuring strategy that organizes
tables to introduce redundancy for improved performance. Denormalization intends to introduce redundancy
in additional dimensions so long as it improves query performance.
The Snowflake Schema is a data warehouse schema that encompasses a logical arrangement of dimension
tables. This data warehouse schema builds on the star schema by adding additional sub-dimension tables that
relate to first-order dimension tables joined to the fact table.
Just like the relationship between the foreign key in the fact table and the primary key in the dimension table,
with the snowflake schema approach, a primary key in a sub-dimension table will relate to a foreign key within
the higher order dimension table.
Snowflake schema creates normalized dimension tables – a database structuring strategy that organizes tables
to reduce redundancy. The purpose of normalization is to eliminate any redundant data to reduce overhead.
The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next iteration
of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy Schema uses
multiple fact tables connected with shared normalized dimension tables. Galaxy Schema can be thought of as
star schema interlinked and completely normalized, avoiding any kind of redundancy or inconsistency of data.
To understand data warehouse schema and its various types at the conceptual level, here are a few things to
remember:
Data warehouse schema is a blueprint for how data will be stored and managed. It includes
definitions of terms, relationships, and the arrangement of those terms and relationships.
Star, galaxy, and snowflake are common types of data warehouse schema that vary in the
arrangement and design of the data relationships.
Star schema is the simplest data warehouse schema and contains just one central table and a handful
of single-dimension tables joined together.
Snowflake schema builds on star schema by adding sub-dimension tables, which eliminates
Redundancy and reduces overhead costs.
Galaxy schema uses multiple fact tables (Snowflake and Star use only one) which makes it like an
Interlinked star schema. This nearly eliminates redundancy and is ideal for complex database
Systems.
There’s no one “best” data warehouse schema. The “best” schema depends on (among other things) your
resources, the type of data you’re working with, and what you’d like to do with it.
For instance, star schema is ideal for organizations that want maximum simplicity and can tolerate higher disk
space usage. But galaxy schema is more suitable for complex data aggregation. And snowflake schema could
be superior for an organization that wants lower data redundancy without the complexity of star schema.
Our agnostic approach to schema management means that StreamSets data pipeline tools can manage any kind
of schema – simple, complex or non-existent. Meaning, with StreamSets you don’t have to spend hours match-
ing the schema from a legacy origin into your destination, instead StreamSets can infer any kind of schema
without you having to lift a finger. If however, you want to enforce a schema and create hard and fast validation
rules, StreamSets can help you with that as well. Our flexibility in how we manage schemas means your data
teams have less to figure out on their own and more time to spend on what really matters: your data.
The process architecture defines an architecture in which the data from the data warehouse is processed for
a particular computation.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
It requires minimal resources both from people and system perspectives.
It is very successful when the collection and consumption of data occur at the same location.
In this architecture, information and its processing are allocated across data centres, and its processing is
distributed across data centres, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized process
architectures where all the information needs to be collected to one central location, and results are available
in one central location.
Client-Server
In this architecture, the user does all the information collecting and presentation, while the server does the
processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server machine, thus mandating
finite states and introducing latencies and overhead in terms of record to be carried between clients and
servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated into
multiple tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in a
cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of a
Client or server or just process data.
Parallelism is used to support speedup, where queries are executed faster because more resources, such as
processors and disks, are provided. Parallelism is also used to provide scale-up, where increasing workloads
are managed without increase response-time, via an increase in the degree of parallelism.
Different architectures for parallel database systems are shared-memory, shared-disk, shared-nothing, and
hierarchical structures.
(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and parallel
processing occurs within a specific task (i.e., table scan) that is performed concurrently on different
processors against different sets of data.
(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan, join, and
sort) are executed in parallel in a pipelined fashion. In other words, an output from one function (e.g., join)
as soon as records become available.
Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors and disks.
Using intraquery parallelism is essential for speeding up long-running queries.
Interquery parallelism
In this method it does not help in this function since each query is run sequentially.
This application of parallelism decomposes the serial SQL, query into lower-level operations such as scan,
join, sort, and aggregation.
In interquery parallelism, different queries or transaction execute in parallel with one another.
This form of parallelism can increase transactions throughput. The response times of individual transactions
are not faster than they would be if the transactions were run in isolation.
Thus, the primary use of interquery parallelism is to scale up a transaction processing system to support a
more significant number of transactions per second.
Database vendors started to take advantage of parallel hardware architectures by implementing multiserver
and multithreaded systems designed to handle a large number of client requests efficiently.
This approach naturally resulted in interquery parallelism, in which different server threads (or processes)
handle multiple requests at the same time.
Interquery parallelism has been successfully implemented on SMP systems, where it increased the
throughput and allowed the support of more concurrent users.
The tools that allow sourcing of data contents and formats accurately and external data stores into the data
warehouse have to perform several essential tasks that contain:
Data consolidation and integration.
Data transformation from one form to another form.
Data transformation and calculation based on the function of business rules that force transformation.
Metadata synchronization and management, which includes storing or updating metadata about
source files, transformation actions, loading formats, and events.
There are several selection criteria which should be considered while implementing a data warehouse:
1. The ability to identify the data in the data source environment that can be read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required
data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data type and the character-set translation is a requirement when moving data
between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are necessary.
11. Vendor stability and support for the products are components that must be evaluated carefully.
A warehousing team will require different types of tools during a warehouse project. These software
products usually fall into one or more of the categories illustrated, as shown in the figure.
Warehouse Storage
Software products are also needed to store warehouse data and their accompanying metadata. Relational
database management systems are well suited to large and growing warehouses.