Data Mining Using Weka Knowledge Flow Environment
Project Report
By
Mustansar Ali (B210317031)
Regd.No: 2021-UOK-04831
Department: Artificial Intelligence
Session: 2021-2025
Subject : Data Mining
Submitted to: Dr. Waqar Malik
Date of Submission: April 12, 2024
DEPARTMENT OF ARTIFICIAL INTELLIGENCE
FACULTY OF COMPUTING AND ENGINEERING
UNIVERSITY OF KOTLI AZAD JAMMU AND KASHMIR
Data Mining Project Report
Contents
Abstract ................................................................................................................................................ 2
1. Introduction to WEKA .................................................................................................................. 4
2. Introduction to Knowledge flow ................................................................................................... 4
3. Creating a Knowledge flow in WEKA ......................................................................................... 5
4. Classification................................................................................................................................. 7
4.1 Performing Classification ...................................................................................................... 7
4.2 Comparing the algorithms performance ................................................................................ 9
5. Association .................................................................................................................................. 10
5.1 Performing Association ....................................................................................................... 10
5.2 Comparing the performance ................................................................................................ 13
6. Clustering .................................................................................................................................... 13
6.1 Performing Clustering ......................................................................................................... 14
6.2 Comparing the performance ................................................................................................ 16
7. Conclusion .................................................................................................................................. 17
Table of Figures
Figure 1 Classification - Random forest ............................................................................................... 8
Figure 2 Results -- Random forest ........................................................................................................ 8
Figure 3 Results -- Linear regression .................................................................................................... 9
Figure 4 Association -- filtered associator .......................................................................................... 11
Figure 5 Association -- filtered associator results............................................................................... 11
Figure 6 Association -- apriori ............................................................................................................ 12
Figure 7 Apriori – results .................................................................................................................... 12
Figure 8 Clustering -- Hierarichal ....................................................................................................... 14
Figure 9 Hierarichal – results ............................................................................................................. 15
Figure 10 Canopy – clustering ............................................................................................................ 15
Figure 11 Canopy clustering -- results ............................................................................................... 16
2
Data Mining Project Report
Abstract
Machine learning plays a pivotal role in extracting meaningful insights and making predictions
from vast datasets across various domains. In this project, we explore the utilization of Weka, an open-
source machine learning software tool, specifically focusing on its Knowledge Flow interface. Our
objective is to investigate the capabilities of Weka's Knowledge Flow in performing classification,
association, and clustering tasks. We begin by providing an overview of Weka and Knowledge Flow,
elucidating their significance in the machine learning workflow. Subsequently, we delve into the
methodology of creating a knowledge flow in Weka and the necessary steps for data preparation.
Utilizing a range of classification, association, and clustering algorithms available in Weka, we
conduct experiments on diverse datasets to evaluate the performance of these tasks. Evaluation metrics
including accuracy, precision, recall, F1 score, and relevant error measures are employed to assess the
effectiveness of the implemented algorithms. Through comprehensive analysis and interpretation of
results, we aim to gain insights into the capabilities and limitations of Weka's Knowledge Flow for
practical machine learning applications. This project contributes to enhancing understanding and
proficiency in leveraging Weka for diverse data analysis and predictive modeling tasks.
3
Data Mining Project Report
1. Introduction to WEKA
Weka, standing for "Waikato Environment for Knowledge Analysis," is a comprehensive suite
of machine learning software tools developed at the University of Waikato in New Zealand. It offers
a wide range of algorithms for data preprocessing, classification, regression, clustering, association
rule mining, and feature selection. Weka is written in Java, making it platform-independent and easily
accessible across different operating systems.
One of the distinguishing features of Weka is its user-friendly graphical user interface (GUI),
which facilitates interactive data exploration and experimentation. The GUI provides intuitive
visualization tools for data analysis, allowing users to visualize datasets, explore attribute distributions,
and inspect model outputs. Additionally, Weka offers a command-line interface for scripting and batch
processing tasks, providing flexibility for automation and integration into larger workflows.
Weka's extensive collection of machine learning algorithms includes both classic techniques
and state-of-the-art methods, making it suitable for a wide range of applications and research purposes.
Moreover, Weka's open-source nature encourages collaboration and community contributions,
fostering continuous development and enhancement of its functionalities.
In addition to its standalone capabilities, Weka can be seamlessly integrated into other software
environments through its Java API, enabling developers to incorporate machine learning
functionalities into custom applications and workflows. Furthermore, Weka supports interoperability
with other data analysis and visualization tools through standard data formats and interfaces, enhancing
its versatility and compatibility with existing software ecosystems.
Overall, Weka serves as a valuable resource for researchers, educators, and practitioners in the
field of machine learning and data mining. Its ease of use, extensive functionality, and active
community support make it a popular choice for various data analysis and predictive modeling tasks.
2. Introduction to Knowledge flow
In the realm of machine learning and data mining, Knowledge Flow represents a graphical user
interface (GUI) paradigm designed to streamline the process of designing, implementing, and
evaluating machine learning workflows. It provides an intuitive and visual way for users to construct
data processing pipelines, apply machine learning algorithms, and analyze results in a systematic
manner.
Knowledge Flow interfaces, such as the one provided in the Weka software suite, offer users a
visual representation of their data analysis workflows, akin to a flowchart. Users can drag and drop
4
Data Mining Project Report
components representing data sources, preprocessing steps, feature selection techniques, machine
learning algorithms, and evaluation methods onto a canvas, and then connect them together to form a
coherent workflow.
One of the key advantages of Knowledge Flow is its ability to simplify complex machine
learning tasks, making them more accessible to users with varying levels of expertise. By abstracting
away the intricacies of programming and algorithm implementation, Knowledge Flow empowers users
to focus on the conceptual aspects of their data analysis tasks, such as selecting appropriate algorithms,
tuning parameters, and interpreting results.
Furthermore, Knowledge Flow facilitates reproducibility and transparency in machine learning
experiments by providing a visual representation of the entire workflow, including data preprocessing
steps, model configurations, and evaluation metrics. This transparency enables users to easily track the
sequence of operations performed on the data and understand the rationale behind the decisions made
during the analysis process.
Another notable feature of Knowledge Flow is its support for interactive experimentation and
real-time feedback. Users can iteratively refine their workflows by adjusting parameters, swapping out
algorithms, and visualizing intermediate results, allowing for rapid prototyping and hypothesis testing.
Overall, Knowledge Flow represents a powerful tool for designing, implementing, and evaluating
machine learning workflows in a visual and interactive manner. Its intuitive interface, support for
reproducibility, and facilitation of experimentation make it a valuable asset for researchers, educators,
and practitioners in the field of machine learning and data mining.
3. Creating a Knowledge flow in WEKA
Launching Weka:
Start by launching the Weka application on your computer. Weka provides a user-friendly
graphical interface that allows us to create and visualize Knowledge Flows.
Opening the Knowledge Flow Environment:
Once Weka is launched, navigate to the "Explorer" tab in the top menu bar. From the dropdown
menu, select "Knowledge Flow." This will open the Knowledge Flow environment within the Weka
interface.
Understanding the Knowledge Flow Interface:
The Knowledge Flow interface consists of several panels:
• Toolbar: Contains tools for adding components to the canvas, running the flow, saving the flow,
etc.
• Canvas: The main area where you design your flow by adding components and connecting them.
5
Data Mining Project Report
• Component Palette: Contains a list of available components such as data sources, preprocessing
filters, classifiers, evaluators, etc.
• Properties Panel: Displays the properties of the selected component, allowing you to modify its
settings.
Adding Components to the Canvas:
To add a component to the canvas, simply drag it from the Component Palette onto the Canvas.
Components can include:
• Data Sources: Represent the input data for your analysis (e.g., ARFF files, databases).
• Preprocessing Filters: Apply transformations or cleanups to your data (e.g., attribute selection,
normalization).
• Classifiers: Algorithms used for classification tasks (e.g., decision trees, support vector machines).
• Evaluators: Assess the performance of your model (e.g., cross-validation, holdout evaluation).
• Connecting Components:
After added components to the canvas, connect them together to define the flow of data and operations.
To connect components, click on the output port of one component and drag the cursor to the input
port of another component.
Configuring Component Properties:
After adding components to the canvas, we can configure their properties by selecting the
component and adjusting its settings in the Properties Panel. For example, we can specify parameters
for classifiers, set options for preprocessing filters, or define evaluation metrics for evaluators.
Running the Knowledge Flow:
Once your Knowledge Flow is set up, we can execute it by clicking on the "Run" button in the
Toolbar. Weka will process the data according to the defined flow, applying preprocessing steps,
training classifiers, and evaluating the model's performance.
Analyzing Results:
After running the Knowledge Flow, you can analyze the results by inspecting output messages
in the Console Panel and viewing visualization outputs (e.g., ROC curves, confusion matrices)
generated by evaluators.
Saving and Exporting:
Once you're satisfied with your Knowledge Flow, you can save it for future use by clicking on
the "Save" button in the Toolbar. You can also export the flow as an XML file or share it with others.
Iterative Refinement:
Knowledge Flow allows for iterative refinement of your analysis. You can modify components,
6
Data Mining Project Report
adjust parameters, and rerun the flow to experiment with different settings and improve model
performance.
By following these steps, we can effectively create and execute Knowledge Flows in Weka for
various machine learning tasks. Experimentation and exploration within the Knowledge Flow
environment enable users to gain deeper insights into their data and develop robust predictive models.
Now we will perform classification, association and clustering one by one and will discuss the
performance of different algorithms on different datasets.
4. Classification
Classification in machine learning is a supervised learning task where the goal is to categorize
input data into predefined classes based on their features. It involves training a model on labeled data
to learn the relationships between input features and class labels, enabling it to predict the class labels
of unseen instances.
4.1 Performing Classification
For classification task I have used my own dataset which I have made as my assignment. The
dataset is basically related to doctor’s appointment and checkup. The dataset is basically a
classification dataset in which class variable is checkup status either done or not.
In order to make a data flow diagram of classification, I have first opened the weka software
and then opened the knowledge flow environment.
The components I used and steps I followed are as follow:
• First of all I took a csv loader to load my dataset.
• Then I took a class assigner to assigns class labels to instances in the dataset, indicating the
target variable or category that the model will predict.
• Then I took a cross validation fold maker to splits the dataset into multiple folds or subsets for
cross-validation, a technique used to assess the performance of a model by training and testing
it on different subsets of the data.
• Then I selected the random forest algorithm to train our model.
• Then I used a classifier performance evaluator which evaluates the performance of a
classification model using various metrics such as accuracy, precision, recall, and F1 score,
allowing users to assess the effectiveness of the model in classifying instances accurately.
• Then at the last I connected a text viewer so that results can be seen.
Here is the final interface of data flow in knowledge flow environment which I got after doing all
these above steps.
7
Data Mining Project Report
Figure 1 Classification - Random forest
When I display the results, here are the results which got.
Figure 2 Results -- Random forest
Then I tried to implement another algorithm so that I can compare which one is performing
better on my dataset. The other algorithm which tried is linear regression. The whole set up was the
same as above I have just replaced random forest with linear regression. After applying linear
regression I got following results.
8
Data Mining Project Report
Figure 3 Results -- Linear regression
4.2 Comparing the algorithms performance
Metric RandomForest LinearRegression
Correlation Coefficient 0.9937 1
Mean Absolute Error 0.0209 0
Root Mean Squared Error 0.0512 0.0004
Relative Absolute Error 5.8906% 0.0084%
Root Relative Squared Error 12.1714% 0.097%
Total Number of Instances 4100 4100
Analysis:
Correlation Coefficient: LinearRegression achieved a perfect correlation coefficient of 1, indicating
a perfect linear relationship between predicted and actual values. RandomForest also achieved a very
high correlation coefficient of 0.9937, indicating a strong correlation.
Mean Absolute Error and Root Mean Squared Error: LinearRegression achieved a mean
absolute error and root mean squared error of 0, suggesting that its predictions exactly match the
actual values. RandomForest had slightly higher errors, with a mean absolute error of 0.0209 and a
root mean squared error of 0.0512.
Relative Absolute Error and Root Relative Squared Error: LinearRegression had extremely low
relative absolute error and root relative squared error values, indicating minimal deviation from
actual values. RandomForest had higher error percentages, but still relatively low compared to the
scale of the data.
Conclusion: Both algorithms performed exceptionally well, but LinearRegression achieved slightly
9
Data Mining Project Report
better results in terms of accuracy and error metrics.
5. Association
Association in data mining refers to the process of discovering interesting relationships or patterns
among variables in large datasets. Unlike classification, which predicts a target variable based on input
features, association analysis focuses on identifying correlations or associations between different
variables without necessarily predicting an outcome. A common application of association analysis is
in market basket analysis, where the goal is to uncover relationships between items purchased together
by customers. The most well-known algorithm for association analysis is Apriori, which identifies
frequent itemsets and generates association rules based on their occurrence patterns. These association
rules provide valuable insights into customer behavior, purchasing patterns, and product
recommendations, helping businesses optimize their marketing strategies and improve customer
satisfaction.
5.1 Performing Association
The dataset which I will use here in this portion is weather. Nominal dataset which I downloaded
from the GitHub but it is also present in weka by default so that it can be used by researchers for
educational purposes. So I also have used the same dataset. As the name indicates it is a nominal
dataset and this dataset for association rule mining. I will apply two different algorithms (filtered
Associator and apriori algorithm) on this dataset and after that will compare the results of both. So
here are the steps which I have followed for knowledge flow diagram of association rule mining.
• First of all I took a Arff loader to load the dataset into the Knowledge Flow environment.
• Then I took a class assigner which specifies the target variable for association rule mining.
• Then I took a cross validation fold maker to splits the dataset into multiple folds or subsets for
cross-validation, a technique used to assess the performance of a model by training and testing it
on different subsets of the data.
• Then I selected the filtered associator an association rule mining algorithm to generate association
rules.
• Then at the last I connected a text viewer so that results can be seen.
Here is the final interface of data flow in knowledge flow environment which I got after doing all these
above steps.
Filtered associator algorithm :
10
Data Mining Project Report
Figure 4 Association -- filtered associator
On execution following results are obtained.
Figure 5 Association -- filtered associator results
11
Data Mining Project Report
Apriori Algorithm :
Figure 6 Association -- apriori
Figure 7 Apriori – results
12
Data Mining Project Report
5.2 Comparing the performance
Metric Apriori FilteredAssociator
Minimum Support 0.15 0.2
Minimum Confidence 0.9 0.9
Number of Cycles 17 16
Number of Rules 10 10
Best Rule: Temperature=cool ==> Humidity=normal Conf:(1), Lift:(1.71) Conf:(1), Lift:(1.71)
Best Rule: Humidity=normal ^ Windy=FALSE ==> Play=yes Conf:(1), Lift:(1.5) Conf:(1), Lift:(1.5)
Best Rule: Outlook=overcast ==> Play=yes Conf:(1), Lift:(1.5) Conf:(1), Lift:(1.5)
Best Rule: Outlook=rainy ^ Play=yes ==> Windy=FALSE Conf:(1), Lift:(1.71) Conf:(1), Lift:(1.71)
Best Rule: Outlook=sunny ^ Humidity=high ==> Temperature=hot Conf:(1), Lift:(3) Conf:(1), Lift:(3)
Analysis:
• Both algorithms (Apriori and Filtered Associator) produced similar results in terms of the generated
association rules, with identical support, confidence, and lift for the top rules.
• The minimum support and confidence thresholds were set to the same values for both algorithms.
• Both algorithms identified the same number of cycles and generated the same number of rules.
• The top association rules discovered by both algorithms are identical, indicating consistency in the
patterns identified from the dataset.
Overall, both association rule mining algorithms produced comparable results, suggesting that the
choice between them may depend on other factors such as computational efficiency, ease of use, or
additional customization options provided by the Filtered Associator algorithm.
6. Clustering
Clustering in the realm of machine learning is an unsupervised learning technique aimed at
organizing a dataset into groups or clusters where instances within the same group exhibit similar
characteristics or patterns. Unlike supervised learning, clustering does not involve labeled data;
instead, it seeks to discover intrinsic structures within the data based solely on the attributes of the
instances. The objective of clustering is to partition the dataset in such a way that instances within the
same cluster are more similar to each other than to those in other clusters, while maximizing the
dissimilarity between clusters. Common clustering algorithms include K-means, hierarchical
clustering, and DBSCAN, each with its own approach to defining clusters based on distance metrics,
density, or connectivity. Clustering finds applications in various domains such as customer
segmentation, anomaly detection, image segmentation, and document clustering, providing valuable
insights into the underlying structure and patterns present in the data.
13
Data Mining Project Report
6.1 Performing Clustering
For clustering I have used the same dataset which I used in Association rule mining but here is
numeric data and in association rule mining I used the nominal data. The dataset name is weather.
Numeric, which is also present in weka application in Arff file format. In order to draw data flow
diagram in weka knowledge flow I have performed the following steps.
• First of all I took a Arff loader to load my dataset.
• Then I took a class assigner to assigns class labels to instances in the dataset, indicating the target
variable or category that the model will predict.
• Then I took a cross validation fold maker to splits the dataset into multiple folds or subsets for
cross-validation, a technique used to assess the performance of a model by training and testing it
on different subsets of the data.
• Then I selected the Hierarchal clustering to make clusters. I set the Euclidean distance as a distance
function and number of clusters are set at 12.
• Then I used a cluster performance evaluator which assesses the quality of clustering results by
providing various metrics to evaluate the compactness and separation of clusters.
• Then at the last I connected a text viewer so that results can be seen.
Hierarchal clustering :
Figure 8 Clustering -- Hierarichal
14
Data Mining Project Report
Results:
Figure 9 Hierarichal – results
Canopy clustering
Figure 10 Canopy – clustering
15
Data Mining Project Report
Results :
Figure 11 Canopy clustering -- results
6.2 Comparing the performance
Metric Canopy Clustering Hierarchical Clustering
(Algorithm 1) (Algorithm 2)
Number of Clusters 12 2
Correctly Clustered
Instances 1 (100%) 1 (100%)
Incorrectly
Clustered Instances 0 (0%) 0 (0%)
Analysis:
• Number of Clusters: Canopy Clustering identified 12 clusters, whereas Hierarchical Clustering
identified 2 clusters.
• Correctly Clustered Instances: Both algorithms correctly clustered all instances, with a 100%
accuracy rate.
• Incorrectly Clustered Instances: Both algorithms did not have any incorrectly clustered instances.
16
Data Mining Project Report
Overall, both clustering algorithms achieved perfect clustering performance in terms of
correctly clustering instances. However, they differ in the number of clusters they identified and their
underlying clustering methodologies. Canopy Clustering tends to produce a larger number of clusters
based on a pre-defined radius threshold, while Hierarchical Clustering builds a hierarchical structure
of clusters based on the proximity of instances. The choice between the two algorithms may depend
on the specific characteristics of the dataset and the desired level of granularity in clustering.
7. Conclusion
Throughout this project, we explored the capabilities of Weka, a powerful tool for data mining
and machine learning. Through the implementation of various tasks including classification,
association rule mining, and clustering, we gained valuable insights into the underlying patterns and
relationships within our datasets. We utilized a range of algorithms such as Random Forest, Apriori,
and Canopy Clustering to analyze and extract meaningful information from the data. Our meticulous
evaluation and comparison of different algorithms demonstrated their effectiveness in addressing
diverse data mining tasks. Ultimately, this project underscores the significance of leveraging advanced
data mining techniques and tools like Weka to uncover actionable insights, optimize decision-making
processes, and drive innovation in various domains."
17