Department of CE/ IT & AIDS Engineering
A.Y 2024-25
LAB MANAL
SUB: Data Mining and Warehousing
Experiment No.: 1
Title:
Create an Employee Table with the help of Data Mining Tool WEKA.
Objective:
To create and visualize an Employee dataset using WEKA and understand how ARFF file
format is used to represent structured data for data mining.
Theory:
WEKA (Waikato Environment for Knowledge Analysis) is a powerful suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand. It
supports several standard data mining tasks, more specifically, data preprocessing,
classification, regression, clustering, association rules, and visualization.
In WEKA, data is typically stored in the ARFF (Attribute-Relation File Format) file. The
ARFF file format includes a header section that defines the attributes (fields), and a data
section that includes the records (instances).
ARFF File Format for Employee Table:
Below is a sample ARFF file content for the Employee table.
@relation employee
@attribute emp_id numeric
@attribute name string
@attribute age numeric
@attribute gender {Male, Female}
@attribute department {HR, IT, Sales, Finance}
@attribute salary numeric
@data
101, 'John', 30, Male, IT, 55000
102, 'Sara', 25, Female, HR, 48000
103, 'Mike', 35, Male, Sales, 62000
104, 'Anna', 28, Female, Finance, 50000
105, 'David', 40, Male, IT, 70000
Procedure:
1. Open the WEKA Explorer.
2. Go to the Preprocess tab.
3. Click on Open file… and browse to your .arff file containing the Employee data.
4. Once loaded, you will see attributes like emp_id, name, age, gender, etc.
5. You can now:
o View the summary of data.
o Apply filters to preprocess data.
o Proceed with classification or clustering, if needed.
Result:
The Employee table was successfully created and loaded in WEKA using an ARFF file. The
dataset is now ready for data mining operations.
Viva Questions:
1. What is the full form of ARFF?
2. What types of attributes can WEKA handle?
3. How do you define categorical vs numerical attributes in ARFF?
4. What is the purpose of WEKA?
5. Can WEKA be used for real-time data mining?
Experiment No.: 2
Title:
Create a Weather Table with the help of Data Mining Tool WEKA.
Objective:
To create and visualize a Weather dataset using WEKA and understand how categorical and
numeric attributes are defined using ARFF format.
Theory:
WEKA is a Java-based open-source tool used for data preprocessing, classification,
regression, clustering, and association rules. It uses ARFF (Attribute-Relation File Format) to
load datasets.
The Weather dataset is a classic dataset used in data mining for classification problems (e.g.,
predicting whether to play based on weather conditions). The dataset includes attributes like
outlook, temperature, humidity, wind, and a target class like “play”.
ARFF File Format for Weather Table:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny, 85, 85, FALSE, no
sunny, 80, 90, TRUE, no
overcast, 83, 78, FALSE, yes
rainy, 70, 96, FALSE, yes
rainy, 68, 80, FALSE, yes
rainy, 65, 70, TRUE, no
overcast, 64, 65, TRUE, yes
sunny, 72, 95, FALSE, no
sunny, 69, 70, FALSE, yes
rainy, 75, 80, FALSE, yes
sunny, 75, 70, TRUE, yes
overcast, 72, 90, TRUE, yes
overcast, 81, 75, FALSE, yes
rainy, 71, 91, TRUE, no
Procedure:
1. Open WEKA Explorer.
2. Navigate to the Preprocess tab.
3. Click on Open file… and select the .arff file containing the Weather data.
4. The dataset will be loaded, and attributes such as outlook, temperature, humidity,
windy, and play will be visible.
5. Use the interface to:
o Explore statistics.
o Visualize attribute distributions.
o Apply classification/clustering if required.
Result:
The Weather table was successfully created and visualized using WEKA. The dataset is now
ready for classification or other data mining operations.
Viva Questions:
1. What are the different data types supported by WEKA?
2. Explain the structure of an ARFF file.
3. What is the use of the “@relation” keyword?
4. How do you represent categorical values in ARFF?
5. Why is the Weather dataset commonly used in classification problems?
Experiment No.: 3
Title: Apply Pre-Processing techniques to the training data set of Weather Table
Objective:
To apply various data pre-processing techniques on the weather dataset to improve data
quality and prepare it for further analysis.
Software / Tool Used:
WEKA (Waikato Environment for Knowledge Analysis)
Theory:
Data preprocessing is an essential step in data mining that involves cleaning and transforming
raw data into an understandable format. It includes the following techniques:
Handling Missing Values
Normalization/Standardization
Discretization
Attribute Removal
Reordering Attributes
WEKA provides a GUI to perform these preprocessing tasks using its Preprocess tab.
Dataset Used:
Weather.arff – A small dataset included with WEKA having attributes like outlook,
temperature, humidity, wind, and play.
Procedure:
1. Open WEKA GUI Chooser → Select Explorer.
2. Click on Open file and load the weather.arff dataset.
3. Under the Preprocess tab:
o To remove an attribute: Select the attribute → Click Remove.
o To normalize: Click Filter → Choose
weka.filters.unsupervised.attribute.Normalize.
o To standardize: Click Filter → Choose
weka.filters.unsupervised.attribute.Standardize.
o To replace missing values: Choose
weka.filters.unsupervised.attribute.ReplaceMissingValues.
o To discretize numeric attributes: Choose
weka.filters.unsupervised.attribute.Discretize.
4. Click Apply after selecting any filter to view changes in the dataset.
Sample Code (for scripting in WEKA command line, optional):
java weka.filters.unsupervised.attribute.ReplaceMissingValues -i weather.arff -o
weather_cleaned.arff
java weka.filters.unsupervised.attribute.Normalize -i weather_cleaned.arff -o
weather_normalized.arff
Result:
Various preprocessing techniques were successfully applied to the weather dataset using
WEKA. The dataset is now cleaned, normalized, and ready for further data mining tasks.
Conclusion:
Pre-processing is a crucial step that improves the quality of data and helps in extracting
meaningful patterns during data mining. Using WEKA, these operations can be performed
efficiently.
Viva Questions:
1. What is the importance of data pre-processing in data mining?
2. What is the difference between normalization and standardization?
3. How does WEKA handle missing values?
4. What is discretization, and when is it used?
5. Why might we choose to remove or reorder attributes?
6. Can you name a few filters available in WEKA for preprocessing?
Experiment No.: 4
Title: Apply Pre-Processing techniques to the training data set of Employee Table
Objective:
To apply data pre-processing techniques on an employee dataset using the WEKA tool in
order to clean and prepare the data for further analysis.
Software / Tool Used:
WEKA (Waikato Environment for Knowledge Analysis)
Theory:
Data preprocessing is a data mining technique used to transform raw data into a clean
dataset. Key preprocessing tasks include:
Handling Missing Values
Normalization/Standardization
Discretization
Removing/Reordering Attributes
Data Type Conversion
WEKA provides built-in filters under the Preprocess tab to implement these techniques
easily.
Dataset Used:
Employee.arff (user-created or sample dataset with attributes like: EmpID, Name, Age,
Department, Salary, Experience, City)
Procedure:
1. Launch WEKA GUI Chooser → Open Explorer.
2. Load the dataset employee.arff using the Open file option.
3. Under the Preprocess tab:
o To handle missing values: Use filter
weka.filters.unsupervised.attribute.ReplaceMissingValues.
o To normalize numeric attributes: Use filter
weka.filters.unsupervised.attribute.Normalize.
o To standardize: Use weka.filters.unsupervised.attribute.Standardize.
o To discretize attributes like Age or Salary: Use
weka.filters.unsupervised.attribute.Discretize.
To remove unnecessary attributes: Select attribute → Click Remove.
o
o
To reorder attributes: Use weka.filters.unsupervised.attribute.Reorder.
4. Click Apply after each filter to observe the transformed data.
Sample WEKA Command Line (Optional):
java weka.filters.unsupervised.attribute.ReplaceMissingValues -i employee.arff -o
employee_cleaned.arff
java weka.filters.unsupervised.attribute.Normalize -i employee_cleaned.arff -o
employee_normalized.arff
Result:
Data preprocessing techniques were successfully applied to the Employee dataset using
WEKA. The dataset is now cleaned, normalized, and prepared for analysis and modeling.
Conclusion:
Preprocessing ensures that the dataset is clean, consistent, and suitable for applying data
mining algorithms. WEKA simplifies preprocessing through its graphical interface and
filters.
Viva Questions:
1. Why is data preprocessing essential in data mining?
2. What are common preprocessing techniques?
3. How can you handle missing values in WEKA?
4. What is the difference between normalization and standardization?
5. Why would you discretize a continuous attribute like salary?
6. How can you remove or reorder attributes in WEKA?
Experiment No.: 5
Title: Normalize Weather Table data using Knowledge Flow
Objective:
To normalize the Weather dataset using the Knowledge Flow interface of WEKA.
Software / Tool Used:
WEKA (Knowledge Flow Interface)
Theory:
Normalization is the process of scaling numeric data to a standard range, typically [0, 1],
which improves the performance of many machine learning algorithms.
Knowledge Flow in WEKA is a visual programming environment that allows users to design
and execute data flows using graphical components instead of command-line or Explorer
interface.
Advantages of Knowledge Flow:
Visual representation of data processing
Modular design of data flow
Easy experimentation and customization
Dataset Used:
Weather.arff (default dataset in WEKA containing attributes such as outlook, temperature,
humidity, wind, and play)
Procedure:
1. Open WEKA GUI Chooser → Click on Knowledge Flow.
2. From the left panel, drag and drop the following components onto the canvas:
o ArffLoader (to load the dataset)
o Normalize (filter for normalization)
o DataViewer (to view results)
3. Connect the components:
o ArffLoader → Normalize (via dataSet)
o Normalize → DataViewer (via outputFormat)
4. Double-click on ArffLoader → Load weather.arff file.
5. Double-click Normalize → Set options if needed.
6. Click Start Loading on ArffLoader.
7. Click Play button on the toolbar to execute the flow.
8. Double-click DataViewer to view the normalized dataset.
Result:
The Weather dataset was successfully normalized using the Knowledge Flow interface in
WEKA. Numeric attributes were scaled to the range [0, 1].
Conclusion:
Knowledge Flow provides a visual and modular way to perform data preprocessing. Using it,
the Weather dataset was normalized efficiently, preparing it for further data mining tasks.
Viva Questions:
1. What is normalization and why is it used?
2. What are the different ways to normalize data?
3. What is the range of data after normalization using WEKA?
4. How does the Knowledge Flow interface differ from Explorer in WEKA?
5. Which filter is used for normalization in Knowledge Flow?
6. Can non-numeric data be normalized? Why or why not?
Experiment No.: 6
Title: Normalize Employee Table data using Knowledge Flow
Objective:
To normalize numeric attributes of the Employee dataset using the Knowledge Flow
interface in WEKA.
Software / Tool Used:
WEKA (Knowledge Flow Interface)
Theory:
Normalization is a data preprocessing technique used to scale numeric values to a common
range, typically between 0 and 1. It is essential for improving the performance of algorithms
that rely on distance metrics or gradient-based optimization.
Knowledge Flow in WEKA offers a graphical environment to design, visualize, and
execute data processing workflows. It supports drag-and-drop components and clear flow
connections.
Dataset Used:
Employee.arff (user-defined dataset with attributes like EmpID, Name, Age, Salary,
Experience, City, etc.)
Procedure:
1. Launch WEKA GUI Chooser → Select Knowledge Flow.
2. From the left component panel, drag and drop the following components:
o ArffLoader (to load the dataset)
o Normalize (to apply normalization)
o DataViewer (to view results)
3. Connect components:
o ArffLoader → Normalize using dataSet
o Normalize → DataViewer using outputFormat
4. Double-click on ArffLoader → Load the file employee.arff.
5. Double-click on Normalize to configure settings if needed (optional).
6. Click Start Loading on ArffLoader.
7. Click the Run (Play) button on the toolbar.
8. Double-click on DataViewer to view the normalized data output.
Result:
The Employee dataset was successfully normalized using the Knowledge Flow interface.
Numeric attributes like Age, Salary, and Experience were scaled to a uniform range.
Conclusion:
Normalization helps eliminate the bias caused by different attribute ranges. Knowledge Flow
in WEKA provides an intuitive, visual way to apply normalization on datasets like
Employee.arff.
Viva Questions:
1. What is the purpose of normalization in data preprocessing?
2. What is the difference between normalization and standardization?
3. Which filter is used for normalization in WEKA Knowledge Flow?
4. Can categorical data be normalized? Why or why not?
5. What range is used for normalized values in WEKA?
6. How is Knowledge Flow useful compared to WEKA Explorer?
Experiment No.: 7
Title: Finding Association Rules for Buying Data
Objective:
To discover association rules from a transactional buying dataset using the Apriori
algorithm in WEKA.
Software / Tool Used:
WEKA (Explorer Interface)
Theory:
Association rule mining is used to uncover relationships between items in large transactional
datasets. A classic example is Market Basket Analysis, where rules like:
If a customer buys Bread, then they are likely to buy Butter.
are extracted.
Key terms:
Support: Frequency of itemset in the dataset.
Confidence: Likelihood of occurrence of consequent given the antecedent.
Lift: Ratio of observed support to expected support.
Apriori Algorithm is used to find frequent itemsets and generate rules based on minimum
support and confidence.
Dataset Used:
BuyingData.arff (A dataset containing transactions like: milk, bread, butter, tea, coffee, etc.)
Procedure:
1. Open WEKA GUI Chooser → Click Explorer.
2. Under the Preprocess tab, click Open File and load BuyingData.arff.
3. Go to the Associate tab.
4. Select the algorithm Apriori from the list.
5. (Optional) Click on Choose → Apriori to modify parameters like:
o Minimum Support (default: 0.1)
o Minimum Confidence (default: 0.9)
o Number of Rules (default: 10)
6. Click Start to run the algorithm.
7. View the generated rules in the result window under Associator Output.
Sample Output:
1. butter=TRUE => bread=TRUE conf:(0.85)
2. milk=TRUE bread=TRUE => tea=TRUE conf:(0.78)
...
Result:
Association rules were successfully extracted from the buying data using the Apriori
algorithm in WEKA, revealing patterns in customer purchasing behavior.
Conclusion:
Association rule mining helps in identifying item co-occurrence patterns in large datasets.
The Apriori algorithm is efficient and widely used for such tasks in retail and
recommendation systems.
Viva Questions:
1. What is an association rule? Provide an example.
2. Define support, confidence, and lift.
3. What is the purpose of the Apriori algorithm?
4. What does a high confidence value indicate in a rule?
5. How can association rules help in retail businesses?
6. How do you set minimum support and confidence in WEKA?
Experiment No.: 8
Title: Finding Association Rules for Banking Data
Objective:
To extract meaningful association rules from banking transactional data using the Apriori
algorithm in WEKA.
Software / Tool Used:
WEKA (Explorer Interface)
Theory:
Association rule mining helps uncover interesting relationships among variables in large
datasets. In the context of banking data, it can reveal patterns like:
If a customer has a savings account, they are likely to apply for a loan.
Key concepts:
Support: How frequently an itemset appears in the dataset.
Confidence: Likelihood that the rule is correct.
Lift: Strength of a rule over random co-occurrence.
WEKA’s Apriori algorithm identifies frequent itemsets and generates rules using defined
thresholds for support and confidence.
Dataset Used:
BankingData.arff
(Sample attributes may include: Has_Savings_Account, Applies_For_Loan,
Owns_Credit_Card, Has_Mobile_Banking, etc.)
Procedure:
1. Open WEKA GUI Chooser → Select Explorer.
2. Go to the Preprocess tab and load BankingData.arff.
3. Navigate to the Associate tab.
4. Choose the algorithm Apriori.
5. (Optional) Adjust parameters such as:
o Minimum support (e.g., 0.2)
o Minimum confidence (e.g., 0.8)
o Number of rules to display (e.g., 10)
6. Click Start to run the algorithm.
7. Review the Associator Output pane to analyze the discovered rules.
Sample Output:
1. Has_Savings_Account=TRUE => Applies_For_Loan=TRUE conf:(0.82)
2. Owns_Credit_Card=TRUE Has_Mobile_Banking=TRUE => Applies_For_Loan=TRUE
conf:(0.77)
Result:
The Apriori algorithm successfully mined association rules from the banking dataset,
revealing patterns in customers’ financial product usage.
Conclusion:
Association rule mining is a powerful tool to analyze customer behavior. In banking, it can be
used for cross-selling, risk analysis, and personalized marketing.
Viva Questions:
1. What are association rules and how are they used in banking?
2. Define support and confidence in the context of data mining.
3. Why is the Apriori algorithm suitable for rule mining?
4. How can association rules help with cross-selling banking products?
5. What does a confidence value of 0.9 imply?
6. Can WEKA generate association rules for numeric attributes directly?