0% found this document useful (0 votes)
20 views66 pages

Ccs341 Datawarehousing

The document outlines the laboratory record for the Data Warehousing course at Mount Zion College, detailing various experiments and procedures related to data exploration, validation, and warehouse design using the WEKA tool. Each experiment includes specific aims, procedures, and steps to achieve tasks such as data integration, architecture planning, and schema definition. The document serves as a comprehensive guide for students in the Computer Science Engineering department to apply theoretical concepts in practical scenarios.

Uploaded by

aarthie2043
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views66 pages

Ccs341 Datawarehousing

The document outlines the laboratory record for the Data Warehousing course at Mount Zion College, detailing various experiments and procedures related to data exploration, validation, and warehouse design using the WEKA tool. Each experiment includes specific aims, procedures, and steps to achieve tasks such as data integration, architecture planning, and schema definition. The document serves as a comprehensive guide for students in the Computer Science Engineering department to apply theoretical concepts in practical scenarios.

Uploaded by

aarthie2043
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

MOUNT ZION COLLEGE OF ENGINEERING

AND TECHNOLOGY

PUDUKKOTTAI 622507

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

REGULATION 2021

CCS341 DATA WAREHOUSING LABORATORY RECORD

III YEAR – VI SEMESTER


CONTENTS

Ex.No
Title of Experiment Page
Number

Data Exploration and Integration with WEKA


1. 2

Apply WEKA tool for Data Validation


2. 8

Plan the Architecture for Real Time Application


3. 17

Write the Query for Schema Definition


4. 20

Design Data Warehouse for Real Time


5. Applications 27

Analyze the Dimensional Modeling


6. 36

Case study using OLAP


7. 38

Case study using OTLP


8. 50

Implementation of Warehouse Testing


9. 57

1
EX.NO:1 DATA EXPLORATION AND INTEGRATION WITH WEKA

AIM :

To explore and integrate for any data set using WEKA tool.

PROCEDURE:

STEP 1:Install the weka tool

EXPLORER- An environment for exploring data with WEKA

a) Click on ―explorer button to bring up the explorer window.

b) Make sure the ―xpreprocesss tab is highlighted.

c) Open a new file by clicking on ―Open New file and choosing a file with ―.arff

etension from the ―Data directory.

d) Attributes appear in the window below.

e) Click on the attributes to see the visualization on the right.

f) Click ―visualize all to see them all

1. PREPROCESSING:

2
Loading Data:

The first four buttons at the top of the preprocess section enable you to load data into
WEKA:

Data Ware Housing & Mining Lab Dept of CSE VEMUIT

1. Open file.... Brings up a dialog box allowing you to browse for the data file on the local
file system.

2. Open URL.... Asks for a Uniform Resource Locator address for where the data is stored.

3. Open DB.... Reads data from a database. (Note that to make this work you might have to
edit the fill in weka/experiment/Database Utils.props.)

4. Generate.... Enables you to generate artificial data from a variety of Data Generators.

2. Classification:

Selecting a Classifier

At the top of the classify section is the Classifier box. This box has a text field that gives the

name of the currently selected classifier, and its options. Clicking on the text box with the
left

mouse button brings up a Generic ObjectEditordialog box, just the same as for filters, that

3
you

can use to configure the optionsof the current classifier.

Test Options

There are four test modes:

Use training set: The classifier is evaluated on how well it predicts the class of
the instances it was trained on.

Supplied test set: The classifier is evaluated on how well it predicts the class of
a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you
to choose the file to test on.

Cross-validation: The classifier is evaluated by cross-validation, using the


number of folds that are entered in the Folds text field.

Percentage split : The classifier is evaluated on how well it predicts a certain


percentage of the data which is held out for testing. The amount of data held out depends
on the value entered in the % field.

3. Clustering:

4
Cluster Modes:

The Cluster mode box is used to choose what to cluster and how to
evaluate the results. The first three options are the same as for classification: Use training
set, Supplied test set and Percentage split.

4. Associating:

Setting Up

This panel contains schemes for learning association rules, and the
learners are chosen and configured in the same way as the clusterers, filters, and classifiers
in the other panels.

5
5. Selecting Attributes:

Searching and Evaluating

Attribute selection involves searching through all possible combinations of


attributes in the database to find which subset of attributes works best for prediction. To do
this, two objects must be set up: an attribute evaluator and a search method. The evaluator
determines what method is used to assign a worth to each subset of attributes. The search
method determines what style of search is performed.

6
6. Visualizing:

Study the arff file format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes
a list of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of Waikato for
use with the Weka machine learning software.

7
8
EX.NO:2 APPLY WEKA TOOL FOR DATA VALIDATION

AIM:

To validate data applying WEKA tool.

PROCEDURE:

STEP 1: PREPROCESSING

1. For preprocessing the data after selecting the dataset

2. Select Filter option & apply the resample filter & see the below results.

3. Select another filter option & apply the discretization filter start the attribute
selection process by clicking on “Start” button.

STEP 2:CLASSIFY

PROCEDURE:

1. Load the dataset (Contact-lenses.arff) into weka

9
2. Go to classify option & in left-hand navigation bar we can see different
classification algorithms under tree section

3. In which we selected Id3 algorithm, in more options select the output


entropy evaluation measures& click on start option.

4. Then we will get classifier output, entropy values& Kappa Statistic as


represented below.

10
STEP3:ASSOCIATE
PROCEDURE:

1. Load the dataset (Breast-Cancer.arff) into weka tool& select the discretize filter
and apply it.

2. Go to associate option & in left-hand navigation bar we can see different


associat algorithms.
3. In which we can select Aprori algorithm & click on select option.
4. Below we can see the rules generated with different support & confidence values for
that selected dataset.

11
STEP 4:CLUSTER
PROCEDURE:
1. Load the dataset (Iris.arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see differentclustering
algorithms
under lazy section.

3. In which we selected Simple K-Means algorithm & click on start option with ―use training
set‖ test option enabled.
4. Then we will get the sum of squared errors, centroids, No. of iterations & clustered
instances
as represented below.

12
13
STEP 5:SELECT ATTRIBUTES
PROCEDURE:

1. Open the Weka GUI Chooser.


2 Click the “Explorer” button to launch the Explorer.
3. Open the dataset.
4. Click the “Select attributes” tab to access the feature selection methods.

14
STEP 6:VISUALIZE
PROCEDURE:

1. To visualize the dataset, go to the Visualize tab.


2. The tab shows the attributes plot matrix.

3. The dataset attributes are marked on the x-axis and y-axis while the instances are plotted.
4. The box with the x-axis attribute and y-axis attribute can be enlarged.

15
16
EX.NO:3 PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION

AIM:
To draw the architecture for the real time sales analysis

PROCEDURE:

STEP 1: CONCEPTUAL MODEL

A conceptual data model recognizes the highest-level relationships


between the different entities.

STEP 2: LOGICAL MODEL

A logical data model defines the information in as much structure


as possible, without observing how they will be physically achieved in the
database.

The primary objective of logical data modeling is to document the


business data structures, processes, rules, and relationships by a single view - the
logical data model.

17
STEP 3: PHYSICAL MODEL

Physical data model describes how the model will be presented in the database.
A physical database model demonstrates all table structures, column names, data
types, constraints, primary key, foreign key, and relationships between tables.

The purpose of physical data modeling is the mapping of the logical data model to
the physical structures of the RDBMS system hosting the data warehouse.

This contains defining physical RDBMS structures, such as tables and data types to
use when storing the information. It may also include the definition of new data structures
for enhancing query performance.

18
19
EX.NO:4 WRITE THE QUERY FOR SCHEMA DEFINITION

AIM:

To write the query for bike store products and sales.


PROCEDURE:
STEP 1:
Select your database

STEP 2:
Choose the 'file' option and open the dataset you want to query

20
STEP 3:
Run the query by clicking the ' execute ' button and the output will shown below

QUERY:
CREATE SCHEMA production;
go
CREATE SCHEMA sales;
go
-- create tables
CREATE TABLE production.categories (
category_id INT IDENTITY (1, 1) PRIMARY KEY,
category_name VARCHAR (255) NOT NULL
);
CREATE TABLE production.brands (
brand_id INT IDENTITY (1, 1) PRIMARY KEY,
brand_name VARCHAR (255) NOT NULL
);
CREATE TABLE production.products (
product_id INT IDENTITY (1, 1) PRIMARY KEY,
product_name VARCHAR (255) NOT NULL,
brand_id INT NOT NULL,
category_id INT NOT NULL,
model_year SMALLINT NOT NULL,
list_price DECIMAL (10, 2) NOT NULL,
FOREIGN KEY (category_id) REFERENCES production.categories (category_id) ON
DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (brand_id) REFERENCES production.brands (brand_id) ON DELETE
CASCADE ON UPDATE CASCADE
);
CREATE TABLE sales.customers (
customer_id INT IDENTITY (1, 1) PRIMARY KEY,

21
first_name VARCHAR (255) NOT NULL,
last_name VARCHAR (255) NOT NULL,
phone VARCHAR (25),
email VARCHAR (255) NOT NULL,
street VARCHAR (255),
city VARCHAR (50),
state VARCHAR (25),
zip_code VARCHAR (5)
);

CREATE TABLE sales.stores (


store_id INT IDENTITY (1, 1) PRIMARY KEY,
store_name VARCHAR (255) NOT NULL,
phone VARCHAR (25),
email VARCHAR (255),
street VARCHAR (255),
city VARCHAR (255),
state VARCHAR (10),
zip_code VARCHAR (5)
);
CREATE TABLE sales.staffs (
staff_id INT IDENTITY (1, 1) PRIMARY KEY,
first_name VARCHAR (50) NOT NULL,
last_name VARCHAR (50) NOT NULL,
email VARCHAR (255) NOT NULL UNIQUE,
phone VARCHAR (25),
active tinyint NOT NULL,
store_id INT NOT NULL,
manager_id INT,
FOREIGN KEY (store_id) REFERENCES sales.stores (store_id) ON DELETE CASCADE ON
UPDATE CASCADE,

22
FOREIGN KEY (manager_id) REFERENCES sales.staffs (staff_id) ON DELETE NO ACTION
ON UPDATE NO ACTION
);
CREATE TABLE sales.orders (
order_id INT IDENTITY (1, 1) PRIMARY KEY,
customer_id INT,
order_status tinyint NOT NULL,
-- Order status: 1 = Pending; 2 = Processing; 3 = Rejected; 4 = Completed
order_date DATE NOT NULL,
required_date DATE NOT NULL,
shipped_date DATE,
store_id INT NOT NULL,
staff_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES sales.customers (customer_id) ON DELETE
CASCADE ON UPDATE CASCADE,
FOREIGN KEY (store_id) REFERENCES sales.stores (store_id) ON DELETE CASCADE ON
UPDATE CASCADE,
FOREIGN KEY (staff_id) REFERENCES sales.staffs (staff_id) ON DELETE NO ACTION ON
UPDATE NO ACTION
);
CREATE TABLE sales.order_items (
order_id INT,
item_id INT,
product_id INT NOT NULL,
quantity INT NOT NULL,
list_price DECIMAL (10, 2) NOT NULL,
discount DECIMAL (4, 2) NOT NULL DEFAULT 0,
PRIMARY KEY (order_id, item_id),
FOREIGN KEY (order_id) REFERENCES sales.orders (order_id) ON DELETE CASCADE
ON UPDATE CASCADE,
FOREIGN KEY (product_id) REFERENCES production.products (product_id) ON DELETE
CASCADE ON UPDATE CASCADE
);

23
CREATE TABLE production.stocks (
store_id INT,
product_id INT,
quantity INT,
PRIMARY KEY (store_id, product_id),
FOREIGN KEY (store_id) REFERENCES sales.stores (store_id) ON DELETE CASCADE ON
UPDATE CASCADE,
FOREIGN KEY (product_id) REFERENCES production.products (product_id) ON DELETE
CASCADE ON UPDATE CASCADE
);

24
STEP 4:
Now refresh the database to see the updated database

25
26
Ex.No.5 DESIGN DATA WAREHOUSE FOR REAL TIME APPLICATIONS

AIM:

To design data warehouse for bike store application


PROCEDURE:
Step 1 : Connect to MSSQL Server or any server using connect option in the Tableau.

Step 2: Select the database from the list of database

Step 3: Drag and drop the tables from the database for OLAP operations

27
Step 4: In our case we are selecting Stores, Orders and Staff tables from Sales Schema.

28
Step 5: Set the relationship between the tables.

Step 6: Dimensions and measures from the selected tables are displayed in Data tab

29
Step 7: Create a Roll up operation Manager Id-> Store Id-> Staff Id

Step 8: Drag and drop the column and rows to get the analysis
Bar Chart

30
31
Heap Map

Bar chart

32
Table Format

33
Step 9: In analytics tab click total to show more discrete dimensions

34
35
EX.NO:6 ANALYSE THE DIMENSIONAL MODELING

AIM:

To apply the concept of dimensional modeling

PROCEDURE:

1. SNOWFLAKE SCHEMA:

The snowflake schema is a variant of the star schema. The centralized fact
table is connected to multiple dimensions.

In the snowflake schema, dimensions are present in a normalized form in


multiple related tables.

The snowflake structure materialized when the dimensions of a star schema


are detailed and highly structured, having several levels of relationship, and the child
tables have multiple parent tables.

2. FACT CONSTELLATION:

A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or


data mart. Fact Constellation Schema can design with a collection of de-normalized FACT,
Shared, and Conformed Dimension tables.
Fact Constellation Schema can implement between aggregate Fact tables
or decompose a complex Fact table into independent simplex Fact tables.

36
37
EX.NO:7 CASE STUDY USING OLAP

consider a sales data cube that includes dimensions such as


time(year,month,day),product(category,subcategory) and region(country,state)
1.define a query to roll up sales data from the lowest level of granularity(e.g.,daily sales)
to higher levels(e.g.,monthly or yearly sales ) for a specific product category and region.

2.write a query to drill down into the sales data for a particular month to analyze daily
sales performance month to analyze daily sales performance for a specific product
subcategory in a specific region.

ANSWER:
CREATE DATABASE sales_cube_database;
USE sales_cube_database;

-- Create time dimension table

CREATE TABLE time_dimension (


time_id INT PRIMARY KEY AUTO_INCREMENT,
year INT,
month INT,
day INT

);
-- Create product dimension table
CREATE TABLE product_dimension (
product_id INT PRIMARY KEY AUTO_INCREMENT,
category VARCHAR(50),

subcategory VARCHAR(50)
);

-- Create region dimension table


CREATE TABLE region_dimension (

region_id INT PRIMARY KEY AUTO_INCREMENT,


38
country VARCHAR(50),
state VARCHAR(50)
);

-- Insert values into time dimension


INSERT INTO time_dimension (year, month, day) VALUES
(2024, 1, 1),
(2024, 1, 2),

(2024, 1, 3),
(2024, 1, 4),
(2024, 2, 1),
(2024, 2, 2),
(2024, 2, 3),

(2024, 2, 4);
SELECT * FROM time_dimension;

-- Insert values into product dimension


INSERT INTO product_dimension (category, subcategory) VALUES
('Electronics', 'Smartphones'),
('Electronics', 'Laptops'),

('Clothing', 'T-shirts');
SELECT * FROM product_dimension;

39
-- Insert values into region dimension
INSERT INTO region_dimension (country, state) VALUES
('USA', 'California'),
('USA', 'New York');

SELECT * FROM region_dimension;

CREATE TABLE sales_data_cube (


sale_id INT PRIMARY KEY,
time_id INT,
product_id INT,
region_id INT,
sales_amount DECIMAL(10, 2)
);

INSERT INTO sales_data_cube (sale_id, time_id, product_id, region_id, sales_amount)

VALUES
(1, 1, 1, 1, 1500.50),
(2, 2, 2, 1, 2500.75),
(3, 3, 3, 1, 500.25),
(4, 4, 1, 2, 1800.30),
(5, 5, 2, 2, 2800.90),

40
(6, 6, 3, 2, 600.45),
(7, 7, 1, 1, 1600.60),
(8, 8, 2, 1, 2700.80),

(9, 1, 3, 1, 550.75),
(10, 2, 1, 2, 1900.35),
(11, 3, 2, 2, 2900.95),
(12, 4, 3, 2, 620.50);
SELECT * FROM sales_data_cube;

1. Roll-up Query:
To roll up sales data from daily to monthly or yearly for a specific product category and
region:

SELECT

t.year,
t.month,
SUM(s.sales_amount) AS total_sales
FROM
sales_data_cube s

41
JOIN
time_dimension t ON s.time_id = t.time_id
JOIN

product_dimension p ON s.product_id = p.product_id


JOIN
region_dimension r ON s.region_id = r.region_id
WHERE
p.category = 'Electronics'

AND r.country = 'USA'


AND r.state = 'California'
GROUP BY
t.year,
t.month;

2. Drill-down Query:
To drill down into sales data for a particular month and analyze daily sales performance for a
specific product subcategory in a specific region:

SELECT
t.year,
t.month,
t.day,

42
s.sales_amount
FROM
sales_data_cube s

JOIN
time_dimension t ON s.time_id = t.time_id
JOIN
product_dimension p ON s.product_id = p.product_id
JOIN

region_dimension r ON s.region_id = r.region_id


WHERE
t.year = 2024
AND t.month = 1
AND p.subcategory = 'Smartphones'

AND r.country = 'USA'


AND r.state = 'California';

43
Imagine you are tasked with creating a data cube to analyse employee performance data
for human resources department.the data cube should have dimensions for time
(year,month,day ),employee (department, job title) and performance
metrics(sales,revenue,customer satisfaction score)
1. Develop a SQL Query to create a necessary tables for storing the employee performance
data and dimensions
2. Populate the tables with sample data representing employee performance metrics over
multiple years ,difference departments and job titles and various performance metrics
3. Write SQL queries to aggregate the employee performance data across different
dimensions such as the total sales revenue by year ,month ,day, department and job title
4. Write SQL Query to slice the employee performance data for a specific departments and
the month , showing the total sales revenue for each job title within the department
5. Write a SQL Query to dice the employee performance data,showing the total sales
revenue for a specific department and job title combination across different months
ANSWER:
1.CREATE DATABASE employee_performance_db;

USE employee_performance_db;

Create time dimension table


CREATE TABLE time_dimension (
time_id INT PRIMARY KEY AUTO_INCREMENT,

year INT,
month INT,
day INT
);

Create employee dimension table


CREATE TABLE employee_dimension (
employee_id INT PRIMARY KEY AUTO_INCREMENT,
department VARCHAR(50),
job_title VARCHAR(50)
);

44
Create performance metrics table
CREATE TABLE performance_metrics (

metric_id INT PRIMARY KEY AUTO_INCREMENT,


sales DECIMAL(10, 2),
revenue DECIMAL(10, 2),
customer_satisfaction_score DECIMAL(5, 2)
);

Insert values into time dimension


INSERT INTO time_dimension (year, month, day) VALUES
(2023, 1, 1),
(2023, 1, 2),

(2023, 1, 3),
(2023, 1, 4),
(2023, 2, 1),
(2023, 2, 2),
(2023, 2, 3),

(2023, 2, 4);
SELECT * FROM time_dimension;

45
Insert values into employee dimension
INSERT INTO employee_dimension (department, job_title) VALUES
('Sales', 'Sales Associate'),

('Sales', 'Sales Manager'),


('Finance', 'Financial Analyst'),
('HR', 'HR Manager');
SELECT * FROM employee_dimension;

Insert values into performance metrics table


INSERT INTO performance_metrics (sales, revenue, customer_satisfaction_score) VALUES

(5000, 75000, 4.5),


(7000, 85000, 4.8),
(6000, 80000, 4.6),
(4000, 60000, 4.3);
SELECT * FROM performance_metrics;

3.SELECT
t.year,
t.month,
SUM(p.sales) AS total_sales,

46
SUM(p.revenue) AS total_revenue,
AVG(p.customer_satisfaction_score) AS avg_customer_satisfaction
FROM

time_dimension t
JOIN
performance_metrics p ON t.time_id = p.metric_id
GROUP BY
t.year,

t.month;

4.SELECT
e.department,
e.job_title,
t.month,

SUM(p.revenue) AS total_revenue
FROM
employee_dimension e
JOIN

47
performance_metrics p ON e.employee_id = p.metric_id
JOIN
time_dimension t ON p.metric_id = t.time_id

WHERE
e.department = 'Sales'
AND t.month = 1
GROUP BY
e.department,

e.job_title,
t.month;

5.SELECT
e.department,

e.job_title,
t.month,
SUM(p.revenue) AS total_revenue
FROM
employee_dimension e

JOIN
performance_metrics p ON e.employee_id = p.metric_id
JOIN

48
time_dimension t ON p.metric_id = t.time_id
WHERE
e.department = 'Sales'

AND e.job_title = 'Sales Manager'


GROUP BY
e.department,
e.job_title,
t.month;

49
EX.NO:8 Case study using OTLP

-- Use Database
USE CSE8582;

-- Create Employee table


CREATE TABLE IF NOT EXISTS Employee (
emp_id INT AUTO_INCREMENT PRIMARY KEY,
emp_name VARCHAR(50),
salary DECIMAL(10, 2),
department VARCHAR(50),
age INT
);

-- Create Company table


CREATE TABLE IF NOT EXISTS Company (
company_id INT AUTO_INCREMENT PRIMARY KEY,
company_name VARCHAR(50),
employee_id INT
);

-- Insert sample data into Employee table


INSERT INTO Employee (emp_name, salary, department, age)
VALUES
('John Doe', 50000.00, 'IT', 30),

50
('Jane Smith', 60000.00, 'HR', 35),
('Michael Johnson', 55000.00, 'Finance', 40),
('Emily Davis', 52000.00, 'Marketing', 28),
('Chris Wilson', 48000.00, 'IT', 32),
('Sarah Brown', 65000.00, 'Finance', 38),
('Kevin Lee', 58000.00, 'Marketing', 27),
('Amanda Miller', 53000.00, 'HR', 33),
('Robert Taylor', 70000.00, 'IT', 45),
('Jennifer Anderson', 62000.00, 'Finance', 42),
('Daniel Thomas', 54000.00, 'Marketing', 29),
('Jessica Martinez', 57000.00, 'HR', 31),
('David Garcia', 67000.00, 'IT', 36),
('Ashley Hernandez', 69000.00, 'Finance', 39),
('Matthew Lopez', 51000.00, 'Marketing', 26);

-- Insert sample data into Company table


INSERT INTO Company (company_name, employee_id)
VALUES
('Company A', 1),
('Company B', 5),
('Company C', 9),
('Company D', 13);

SELECT * FROM Company;

SELECT * FROM Emloyee;

51
Inner Join

SELECT *
FROM Employee
INNER JOIN Company ON Employee.emp_id = Company.employee_id;

Right Join

SELECT *
FROM Employee
RIGHT JOIN Company ON Employee.emp_id = Company.employee_id;

52
Left Join

SELECT *
FROM Employee
LEFT JOIN Company ON Employee.emp_id = Company.employee_id;

Self Join

SELECT e1.emp_name AS employee_name, e2.emp_name AS colleague_name


FROM Employee e1
INNER JOIN Employee e2 ON e1.department = e2.department AND e1.emp_id != e2.emp_id;

53
Group By

SELECT department, AVG(salary) AS avg_salary


FROM Employee
GROUP BY department;

Having

SELECT department, AVG(salary) AS avg_salary

54
FROM Employee
GROUP BY department
HAVING AVG(salary) > 55000;

-- Having Count

SELECT department, COUNT(*) AS employee_count


FROM Employee
GROUP BY department
HAVING COUNT(*) > 2;

Roll Up

SELECT department, AVG(salary) AS avg_salary


FROM Employee
GROUP BY department WITH ROLLUP;

Subquery
SELECT emp_id, emp_name, salary, department
FROM Employee
WHERE salary > (SELECT AVG(salary) FROM Employee e2 WHERE e2.department =
Employee.department);

55
Cross Join

SELECT *
FROM Employee
CROSS JOIN Company;

56
EX.NO:9 IMPLEMENTATION OF WAREHOUSE TESTING

AIM:
To implement warehouse testing.

PROCEDURE:

QUERYSURGE
Query Surge is the smart Data Testing solution that automates the data
validation and ETL testing of Big Data, Data Warehouses, Business Intelligence Reports and
Enterprise Applications with full DevOps functionality for continuous testing.
By analyzing and pinpointing any differences in the data, Query Surge
ensures that the data extracted from source systems remains intact in the target and
complies with transformation requirements. Query Surge is an essential asset to every data
testing process
TERMS:
Query Pair -A pair of SQL queries with one query that retrieves data from a
source file or database and another SQL query that retrieves data from a target database,
data warehouse, or data mart.
Agent -Performs the actual query tasks. Agents execute queries against source
and target data stores and return the results to the Query Surge database.
Query Snippet -A reusable piece of SQL code that can be embedded in one or
more queries. The purpose of a Snippet is to minimize the number manual changes needed
in different SQL calls when they contain the same code.
Query Wizard- A tool that allows you to generate Query Pairs automatically,
requiring no SQL coding. It is a fast and easy way to create Query Pairs for both manual
testers who do not have SQL skills and testers who are skilled at SQL and want to speed up
test authoring. The Query Wizard generates tests that can cover about 80% of all data in a
data warehouse automatically
Command Line Integration -Provides you with the ability to schedule Test Suites
to run using Windows Task Scheduler or integrate with a continuous build system.
INSTALLATION:
1. On the installation machine, launch a supported browser
2. In the URL field type ‘http://localhost/QuerySurge/’
3. Query Surge will launch.
4. If your Query Surge install is local, enter your credentials or use the
credentials admin/admin. If you are using a Cloud trial, use the credentials
clouduser/clouduser. Click Login.
Creating connections to your systems:

57
When you create a Query Surge Connection, the Connection Wizard will
guide you through the process. Different types of Query Surge connections require different
types of information.
For connections that the Connection Wizard does not handle explicitly, you
can set up a connection using Query Surge’s Connection Extensibility feature. Typically, you
will need the following information (check with a DBA or other knowledgeable resource in
your organization, or consult JDBC driver documentation); the details depend on the
Connection you plan to set up:
• JDBC Driver Class
• JDBC Connection URL (which may contain some of the common items below)
o Server Name or IP address of the Server (e.g. db.myserver.com, or
192.168.0.255)
o The port for your database
• Database login credentials (Username and Password)

Launch the Connection Wizard :


1. Log into QuerySurge as an Admin user.

2. To configure a Connection, select Configuration > Connection in the


Administrative View tree (at the left).

58
3. Click on the Add button at the bottom left of the page to launch the
Connection Wizard

59
60
61
4. Click the Next button.
5. Provide a name for your connection in the Connection Name field.
6. Select your Data Source from the dropdown
7. Provide the appropriate Driver Class for the JDBC driver you are using.

8. Provide the connection information to your database. This includes the Connection URL
(which may contain the Server name or IP address and the port), the login credentials, and
an optional Test Query that will run to verify the Connection details.
9. Click the Next button.
10. Click the Test Connection button if you entered a Test Query
11. Click the Save button.

62
63
TESTING:
1. Setting up Connections :
Users configure Query Surge to connect to their various data sources,
such as databases (Oracle, SQL Server, MySQL, etc.), data warehouses (Teradata, Redshift,
Snowflake, etc.), flat files (CSV, Excel, etc.), and big data platforms (Hadoop, Hive, etc.).

2. Designing Tests:
Once connections are established, users create test cases within Query Surge.
These test cases define the data queries to be executed against the source and target data,
as well as any assertions or validations to be performed on the results.

3. Executing Tests:
Users can run individual tests or test suites within Query Surge. During test
execution, Query Surge compares the data retrieved from the source against the data
retrieved from the target (e.g., a data warehouse) based on the defined test cases.

4. Analyzing Results:
Query Surge provides detailed reports and dashboards showing the results of
the test execution. It highlights any discrepancies or failures between the source and target
data, allowing users to identify and investigate data quality issues.
5. Iterative Testing:

64
Users can refine their test cases and rerun tests as needed to ensure data
accuracy and integrity throughout the development lifecycle. Query Surge supports
automated scheduling of tests, allowing for continuous integration and regression testing.

6. Integrations:
Query Surge can integrate with various continuous integration and continuous
delivery (CI/CD) tools such as Jenkins, Bamboo, and Team City, enabling seamless
incorporation of data testing into the software development pipeline.

7. Point-to-Point Testing.
The Query Surge ETL testing process mimics the ETL development process by testing
data from point-to-point along the data warehouse lifecycle and can provide 100% coverage
of your data mappings.

8. Test Across Platforms


Test across 200+ data stores. QuerySurge supports connections to data warehouses and
databases, big data and NoSQL data stores, files and APIs, collaboration software, CRMs and
ERPs, and accounting, marketing and ecommerce software. See the full list here»

Query Surge will help to:


- Continuously detect data issues in the delivery pipeline
- Dramatically increase data validation coverage
- Leverage analytics to optimize your critical data
- Improve your data quality at speed
- Provide a huge ROI

65

You might also like