Ccs341 Datawarehousing
Ccs341 Datawarehousing
AND TECHNOLOGY
PUDUKKOTTAI 622507
REGULATION 2021
Ex.No
Title of Experiment Page
Number
1
EX.NO:1 DATA EXPLORATION AND INTEGRATION WITH WEKA
AIM :
To explore and integrate for any data set using WEKA tool.
PROCEDURE:
c) Open a new file by clicking on ―Open New file and choosing a file with ―.arff
1. PREPROCESSING:
2
Loading Data:
The first four buttons at the top of the preprocess section enable you to load data into
WEKA:
1. Open file.... Brings up a dialog box allowing you to browse for the data file on the local
file system.
2. Open URL.... Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB.... Reads data from a database. (Note that to make this work you might have to
edit the fill in weka/experiment/Database Utils.props.)
4. Generate.... Enables you to generate artificial data from a variety of Data Generators.
2. Classification:
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text field that gives the
name of the currently selected classifier, and its options. Clicking on the text box with the
left
mouse button brings up a Generic ObjectEditordialog box, just the same as for filters, that
3
you
Test Options
Use training set: The classifier is evaluated on how well it predicts the class of
the instances it was trained on.
Supplied test set: The classifier is evaluated on how well it predicts the class of
a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you
to choose the file to test on.
3. Clustering:
4
Cluster Modes:
The Cluster mode box is used to choose what to cluster and how to
evaluate the results. The first three options are the same as for classification: Use training
set, Supplied test set and Percentage split.
4. Associating:
Setting Up
This panel contains schemes for learning association rules, and the
learners are chosen and configured in the same way as the clusterers, filters, and classifiers
in the other panels.
5
5. Selecting Attributes:
6
6. Visualizing:
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes
a list of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of Waikato for
use with the Weka machine learning software.
7
8
EX.NO:2 APPLY WEKA TOOL FOR DATA VALIDATION
AIM:
PROCEDURE:
STEP 1: PREPROCESSING
2. Select Filter option & apply the resample filter & see the below results.
3. Select another filter option & apply the discretization filter start the attribute
selection process by clicking on “Start” button.
STEP 2:CLASSIFY
PROCEDURE:
9
2. Go to classify option & in left-hand navigation bar we can see different
classification algorithms under tree section
10
STEP3:ASSOCIATE
PROCEDURE:
1. Load the dataset (Breast-Cancer.arff) into weka tool& select the discretize filter
and apply it.
11
STEP 4:CLUSTER
PROCEDURE:
1. Load the dataset (Iris.arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see differentclustering
algorithms
under lazy section.
3. In which we selected Simple K-Means algorithm & click on start option with ―use training
set‖ test option enabled.
4. Then we will get the sum of squared errors, centroids, No. of iterations & clustered
instances
as represented below.
12
13
STEP 5:SELECT ATTRIBUTES
PROCEDURE:
14
STEP 6:VISUALIZE
PROCEDURE:
3. The dataset attributes are marked on the x-axis and y-axis while the instances are plotted.
4. The box with the x-axis attribute and y-axis attribute can be enlarged.
15
16
EX.NO:3 PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION
AIM:
To draw the architecture for the real time sales analysis
PROCEDURE:
17
STEP 3: PHYSICAL MODEL
Physical data model describes how the model will be presented in the database.
A physical database model demonstrates all table structures, column names, data
types, constraints, primary key, foreign key, and relationships between tables.
The purpose of physical data modeling is the mapping of the logical data model to
the physical structures of the RDBMS system hosting the data warehouse.
This contains defining physical RDBMS structures, such as tables and data types to
use when storing the information. It may also include the definition of new data structures
for enhancing query performance.
18
19
EX.NO:4 WRITE THE QUERY FOR SCHEMA DEFINITION
AIM:
STEP 2:
Choose the 'file' option and open the dataset you want to query
20
STEP 3:
Run the query by clicking the ' execute ' button and the output will shown below
QUERY:
CREATE SCHEMA production;
go
CREATE SCHEMA sales;
go
-- create tables
CREATE TABLE production.categories (
category_id INT IDENTITY (1, 1) PRIMARY KEY,
category_name VARCHAR (255) NOT NULL
);
CREATE TABLE production.brands (
brand_id INT IDENTITY (1, 1) PRIMARY KEY,
brand_name VARCHAR (255) NOT NULL
);
CREATE TABLE production.products (
product_id INT IDENTITY (1, 1) PRIMARY KEY,
product_name VARCHAR (255) NOT NULL,
brand_id INT NOT NULL,
category_id INT NOT NULL,
model_year SMALLINT NOT NULL,
list_price DECIMAL (10, 2) NOT NULL,
FOREIGN KEY (category_id) REFERENCES production.categories (category_id) ON
DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (brand_id) REFERENCES production.brands (brand_id) ON DELETE
CASCADE ON UPDATE CASCADE
);
CREATE TABLE sales.customers (
customer_id INT IDENTITY (1, 1) PRIMARY KEY,
21
first_name VARCHAR (255) NOT NULL,
last_name VARCHAR (255) NOT NULL,
phone VARCHAR (25),
email VARCHAR (255) NOT NULL,
street VARCHAR (255),
city VARCHAR (50),
state VARCHAR (25),
zip_code VARCHAR (5)
);
22
FOREIGN KEY (manager_id) REFERENCES sales.staffs (staff_id) ON DELETE NO ACTION
ON UPDATE NO ACTION
);
CREATE TABLE sales.orders (
order_id INT IDENTITY (1, 1) PRIMARY KEY,
customer_id INT,
order_status tinyint NOT NULL,
-- Order status: 1 = Pending; 2 = Processing; 3 = Rejected; 4 = Completed
order_date DATE NOT NULL,
required_date DATE NOT NULL,
shipped_date DATE,
store_id INT NOT NULL,
staff_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES sales.customers (customer_id) ON DELETE
CASCADE ON UPDATE CASCADE,
FOREIGN KEY (store_id) REFERENCES sales.stores (store_id) ON DELETE CASCADE ON
UPDATE CASCADE,
FOREIGN KEY (staff_id) REFERENCES sales.staffs (staff_id) ON DELETE NO ACTION ON
UPDATE NO ACTION
);
CREATE TABLE sales.order_items (
order_id INT,
item_id INT,
product_id INT NOT NULL,
quantity INT NOT NULL,
list_price DECIMAL (10, 2) NOT NULL,
discount DECIMAL (4, 2) NOT NULL DEFAULT 0,
PRIMARY KEY (order_id, item_id),
FOREIGN KEY (order_id) REFERENCES sales.orders (order_id) ON DELETE CASCADE
ON UPDATE CASCADE,
FOREIGN KEY (product_id) REFERENCES production.products (product_id) ON DELETE
CASCADE ON UPDATE CASCADE
);
23
CREATE TABLE production.stocks (
store_id INT,
product_id INT,
quantity INT,
PRIMARY KEY (store_id, product_id),
FOREIGN KEY (store_id) REFERENCES sales.stores (store_id) ON DELETE CASCADE ON
UPDATE CASCADE,
FOREIGN KEY (product_id) REFERENCES production.products (product_id) ON DELETE
CASCADE ON UPDATE CASCADE
);
24
STEP 4:
Now refresh the database to see the updated database
25
26
Ex.No.5 DESIGN DATA WAREHOUSE FOR REAL TIME APPLICATIONS
AIM:
Step 3: Drag and drop the tables from the database for OLAP operations
27
Step 4: In our case we are selecting Stores, Orders and Staff tables from Sales Schema.
28
Step 5: Set the relationship between the tables.
Step 6: Dimensions and measures from the selected tables are displayed in Data tab
29
Step 7: Create a Roll up operation Manager Id-> Store Id-> Staff Id
Step 8: Drag and drop the column and rows to get the analysis
Bar Chart
30
31
Heap Map
Bar chart
32
Table Format
33
Step 9: In analytics tab click total to show more discrete dimensions
34
35
EX.NO:6 ANALYSE THE DIMENSIONAL MODELING
AIM:
PROCEDURE:
1. SNOWFLAKE SCHEMA:
The snowflake schema is a variant of the star schema. The centralized fact
table is connected to multiple dimensions.
2. FACT CONSTELLATION:
A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
36
37
EX.NO:7 CASE STUDY USING OLAP
2.write a query to drill down into the sales data for a particular month to analyze daily
sales performance month to analyze daily sales performance for a specific product
subcategory in a specific region.
ANSWER:
CREATE DATABASE sales_cube_database;
USE sales_cube_database;
);
-- Create product dimension table
CREATE TABLE product_dimension (
product_id INT PRIMARY KEY AUTO_INCREMENT,
category VARCHAR(50),
subcategory VARCHAR(50)
);
(2024, 1, 3),
(2024, 1, 4),
(2024, 2, 1),
(2024, 2, 2),
(2024, 2, 3),
(2024, 2, 4);
SELECT * FROM time_dimension;
('Clothing', 'T-shirts');
SELECT * FROM product_dimension;
39
-- Insert values into region dimension
INSERT INTO region_dimension (country, state) VALUES
('USA', 'California'),
('USA', 'New York');
VALUES
(1, 1, 1, 1, 1500.50),
(2, 2, 2, 1, 2500.75),
(3, 3, 3, 1, 500.25),
(4, 4, 1, 2, 1800.30),
(5, 5, 2, 2, 2800.90),
40
(6, 6, 3, 2, 600.45),
(7, 7, 1, 1, 1600.60),
(8, 8, 2, 1, 2700.80),
(9, 1, 3, 1, 550.75),
(10, 2, 1, 2, 1900.35),
(11, 3, 2, 2, 2900.95),
(12, 4, 3, 2, 620.50);
SELECT * FROM sales_data_cube;
1. Roll-up Query:
To roll up sales data from daily to monthly or yearly for a specific product category and
region:
SELECT
t.year,
t.month,
SUM(s.sales_amount) AS total_sales
FROM
sales_data_cube s
41
JOIN
time_dimension t ON s.time_id = t.time_id
JOIN
2. Drill-down Query:
To drill down into sales data for a particular month and analyze daily sales performance for a
specific product subcategory in a specific region:
SELECT
t.year,
t.month,
t.day,
42
s.sales_amount
FROM
sales_data_cube s
JOIN
time_dimension t ON s.time_id = t.time_id
JOIN
product_dimension p ON s.product_id = p.product_id
JOIN
43
Imagine you are tasked with creating a data cube to analyse employee performance data
for human resources department.the data cube should have dimensions for time
(year,month,day ),employee (department, job title) and performance
metrics(sales,revenue,customer satisfaction score)
1. Develop a SQL Query to create a necessary tables for storing the employee performance
data and dimensions
2. Populate the tables with sample data representing employee performance metrics over
multiple years ,difference departments and job titles and various performance metrics
3. Write SQL queries to aggregate the employee performance data across different
dimensions such as the total sales revenue by year ,month ,day, department and job title
4. Write SQL Query to slice the employee performance data for a specific departments and
the month , showing the total sales revenue for each job title within the department
5. Write a SQL Query to dice the employee performance data,showing the total sales
revenue for a specific department and job title combination across different months
ANSWER:
1.CREATE DATABASE employee_performance_db;
USE employee_performance_db;
year INT,
month INT,
day INT
);
44
Create performance metrics table
CREATE TABLE performance_metrics (
(2023, 1, 3),
(2023, 1, 4),
(2023, 2, 1),
(2023, 2, 2),
(2023, 2, 3),
(2023, 2, 4);
SELECT * FROM time_dimension;
45
Insert values into employee dimension
INSERT INTO employee_dimension (department, job_title) VALUES
('Sales', 'Sales Associate'),
3.SELECT
t.year,
t.month,
SUM(p.sales) AS total_sales,
46
SUM(p.revenue) AS total_revenue,
AVG(p.customer_satisfaction_score) AS avg_customer_satisfaction
FROM
time_dimension t
JOIN
performance_metrics p ON t.time_id = p.metric_id
GROUP BY
t.year,
t.month;
4.SELECT
e.department,
e.job_title,
t.month,
SUM(p.revenue) AS total_revenue
FROM
employee_dimension e
JOIN
47
performance_metrics p ON e.employee_id = p.metric_id
JOIN
time_dimension t ON p.metric_id = t.time_id
WHERE
e.department = 'Sales'
AND t.month = 1
GROUP BY
e.department,
e.job_title,
t.month;
5.SELECT
e.department,
e.job_title,
t.month,
SUM(p.revenue) AS total_revenue
FROM
employee_dimension e
JOIN
performance_metrics p ON e.employee_id = p.metric_id
JOIN
48
time_dimension t ON p.metric_id = t.time_id
WHERE
e.department = 'Sales'
49
EX.NO:8 Case study using OTLP
-- Use Database
USE CSE8582;
50
('Jane Smith', 60000.00, 'HR', 35),
('Michael Johnson', 55000.00, 'Finance', 40),
('Emily Davis', 52000.00, 'Marketing', 28),
('Chris Wilson', 48000.00, 'IT', 32),
('Sarah Brown', 65000.00, 'Finance', 38),
('Kevin Lee', 58000.00, 'Marketing', 27),
('Amanda Miller', 53000.00, 'HR', 33),
('Robert Taylor', 70000.00, 'IT', 45),
('Jennifer Anderson', 62000.00, 'Finance', 42),
('Daniel Thomas', 54000.00, 'Marketing', 29),
('Jessica Martinez', 57000.00, 'HR', 31),
('David Garcia', 67000.00, 'IT', 36),
('Ashley Hernandez', 69000.00, 'Finance', 39),
('Matthew Lopez', 51000.00, 'Marketing', 26);
51
Inner Join
SELECT *
FROM Employee
INNER JOIN Company ON Employee.emp_id = Company.employee_id;
Right Join
SELECT *
FROM Employee
RIGHT JOIN Company ON Employee.emp_id = Company.employee_id;
52
Left Join
SELECT *
FROM Employee
LEFT JOIN Company ON Employee.emp_id = Company.employee_id;
Self Join
53
Group By
Having
54
FROM Employee
GROUP BY department
HAVING AVG(salary) > 55000;
-- Having Count
Roll Up
Subquery
SELECT emp_id, emp_name, salary, department
FROM Employee
WHERE salary > (SELECT AVG(salary) FROM Employee e2 WHERE e2.department =
Employee.department);
55
Cross Join
SELECT *
FROM Employee
CROSS JOIN Company;
56
EX.NO:9 IMPLEMENTATION OF WAREHOUSE TESTING
AIM:
To implement warehouse testing.
PROCEDURE:
QUERYSURGE
Query Surge is the smart Data Testing solution that automates the data
validation and ETL testing of Big Data, Data Warehouses, Business Intelligence Reports and
Enterprise Applications with full DevOps functionality for continuous testing.
By analyzing and pinpointing any differences in the data, Query Surge
ensures that the data extracted from source systems remains intact in the target and
complies with transformation requirements. Query Surge is an essential asset to every data
testing process
TERMS:
Query Pair -A pair of SQL queries with one query that retrieves data from a
source file or database and another SQL query that retrieves data from a target database,
data warehouse, or data mart.
Agent -Performs the actual query tasks. Agents execute queries against source
and target data stores and return the results to the Query Surge database.
Query Snippet -A reusable piece of SQL code that can be embedded in one or
more queries. The purpose of a Snippet is to minimize the number manual changes needed
in different SQL calls when they contain the same code.
Query Wizard- A tool that allows you to generate Query Pairs automatically,
requiring no SQL coding. It is a fast and easy way to create Query Pairs for both manual
testers who do not have SQL skills and testers who are skilled at SQL and want to speed up
test authoring. The Query Wizard generates tests that can cover about 80% of all data in a
data warehouse automatically
Command Line Integration -Provides you with the ability to schedule Test Suites
to run using Windows Task Scheduler or integrate with a continuous build system.
INSTALLATION:
1. On the installation machine, launch a supported browser
2. In the URL field type ‘http://localhost/QuerySurge/’
3. Query Surge will launch.
4. If your Query Surge install is local, enter your credentials or use the
credentials admin/admin. If you are using a Cloud trial, use the credentials
clouduser/clouduser. Click Login.
Creating connections to your systems:
57
When you create a Query Surge Connection, the Connection Wizard will
guide you through the process. Different types of Query Surge connections require different
types of information.
For connections that the Connection Wizard does not handle explicitly, you
can set up a connection using Query Surge’s Connection Extensibility feature. Typically, you
will need the following information (check with a DBA or other knowledgeable resource in
your organization, or consult JDBC driver documentation); the details depend on the
Connection you plan to set up:
• JDBC Driver Class
• JDBC Connection URL (which may contain some of the common items below)
o Server Name or IP address of the Server (e.g. db.myserver.com, or
192.168.0.255)
o The port for your database
• Database login credentials (Username and Password)
58
3. Click on the Add button at the bottom left of the page to launch the
Connection Wizard
59
60
61
4. Click the Next button.
5. Provide a name for your connection in the Connection Name field.
6. Select your Data Source from the dropdown
7. Provide the appropriate Driver Class for the JDBC driver you are using.
8. Provide the connection information to your database. This includes the Connection URL
(which may contain the Server name or IP address and the port), the login credentials, and
an optional Test Query that will run to verify the Connection details.
9. Click the Next button.
10. Click the Test Connection button if you entered a Test Query
11. Click the Save button.
62
63
TESTING:
1. Setting up Connections :
Users configure Query Surge to connect to their various data sources,
such as databases (Oracle, SQL Server, MySQL, etc.), data warehouses (Teradata, Redshift,
Snowflake, etc.), flat files (CSV, Excel, etc.), and big data platforms (Hadoop, Hive, etc.).
2. Designing Tests:
Once connections are established, users create test cases within Query Surge.
These test cases define the data queries to be executed against the source and target data,
as well as any assertions or validations to be performed on the results.
3. Executing Tests:
Users can run individual tests or test suites within Query Surge. During test
execution, Query Surge compares the data retrieved from the source against the data
retrieved from the target (e.g., a data warehouse) based on the defined test cases.
4. Analyzing Results:
Query Surge provides detailed reports and dashboards showing the results of
the test execution. It highlights any discrepancies or failures between the source and target
data, allowing users to identify and investigate data quality issues.
5. Iterative Testing:
64
Users can refine their test cases and rerun tests as needed to ensure data
accuracy and integrity throughout the development lifecycle. Query Surge supports
automated scheduling of tests, allowing for continuous integration and regression testing.
6. Integrations:
Query Surge can integrate with various continuous integration and continuous
delivery (CI/CD) tools such as Jenkins, Bamboo, and Team City, enabling seamless
incorporation of data testing into the software development pipeline.
7. Point-to-Point Testing.
The Query Surge ETL testing process mimics the ETL development process by testing
data from point-to-point along the data warehouse lifecycle and can provide 100% coverage
of your data mappings.
65