An Introduction to WEKA
Contributed by Yizhou Sun
2008
Content
What is WEKA?
The Explorer:
Preprocess data
Classification
Clustering
Association Rules
Attribute Selection
Data Visualization
References and Resources
2 06/16/21
What is WEKA?
Waikato Environment for Knowledge Analysis
It’s a data mining/machine learning tool developed by
Department of Computer Science, University of
Waikato, New Zealand.
Weka is also a bird found only on the islands of New
Zealand.
3 06/16/21
Download and Install WEKA
Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Support multiple platforms (written in java):
Windows, Mac OS X and Linux
4 06/16/21
Main Features
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
3 algorithms for finding association rules
15 attribute/subset evaluators + 10 search
algorithms for feature selection
5 06/16/21
Main GUI
Three graphical user interfaces
“The Explorer” (exploratory data analysis)
“The Experimenter” (experimental
environment)
“The KnowledgeFlow” (new process
model inspired interface)
6 06/16/21
Content
What is WEKA?
The Explorer:
Preprocess data
Classification
Clustering
Association Rules
Attribute Selection
Data Visualization
References and Resources
7 06/16/21
Explorer: pre-processing the data
Data can be imported from a file in various formats:
ARFF, CSV, C4.5, binary
Data can also be read from a URL or from an SQL
database (using JDBC)
Pre-processing tools in WEKA are called “filters”
WEKA contains filters for:
Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
8 06/16/21
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
9 06/16/21
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
10 06/16/21
11 University of Waikato 06/16/21
12 University of Waikato 06/16/21
13 University of Waikato 06/16/21
14 University of Waikato 06/16/21
15 University of Waikato 06/16/21
16 University of Waikato 06/16/21
17 University of Waikato 06/16/21
18 University of Waikato 06/16/21
19 University of Waikato 06/16/21
20 University of Waikato 06/16/21
21 University of Waikato 06/16/21
22 University of Waikato 06/16/21
23 University of Waikato 06/16/21
24 University of Waikato 06/16/21
25 University of Waikato 06/16/21
26 University of Waikato 06/16/21
27 University of Waikato 06/16/21
28 University of Waikato 06/16/21
29 University of Waikato 06/16/21
30 University of Waikato 06/16/21
31 University of Waikato 06/16/21
Explorer: building “classifiers”
Classifiers in WEKA are models for predicting
nominal or numeric quantities
Implemented learning schemes include:
Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …
32 06/16/21
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example of >40 low yes fair yes
Quinlan’s >40 low yes excellent no
31…40 low yes excellent yes
ID3 <=30 medium no fair no
(Playing <=30 low yes fair yes
Tennis) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
33 June 16, 2021
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
34 June 16, 2021
36 University of Waikato 06/16/21
37 University of Waikato 06/16/21
38 University of Waikato 06/16/21
39 University of Waikato 06/16/21
40 University of Waikato 06/16/21
41 University of Waikato 06/16/21
42 University of Waikato 06/16/21
43 University of Waikato 06/16/21
44 University of Waikato 06/16/21
45 University of Waikato 06/16/21
46 University of Waikato 06/16/21
47 University of Waikato 06/16/21
48 University of Waikato 06/16/21
49 University of Waikato 06/16/21
50 University of Waikato 06/16/21
51 University of Waikato 06/16/21
52 University of Waikato 06/16/21
53 University of Waikato 06/16/21
54 University of Waikato 06/16/21
55 University of Waikato 06/16/21
56 University of Waikato 06/16/21
57 University of Waikato 06/16/21
Explorer: finding associations
WEKA contains an implementation of the Apriori
algorithm for learning association rules
Works only with discrete data
Can identify statistical dependencies between groups
of attributes:
milk, butter bread, eggs (with confidence 0.9 and
support 2000)
Apriori can compute all rules that have a given
minimum support and exceed a given confidence
61 06/16/21
Basic Concepts: Frequent Patterns
Tid Items bought itemset: A set of one or more items
10 Beer, Nuts, Diaper k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper (absolute) support, or, support count
30 Beer, Diaper, Eggs
of X: Frequency or occurrence of an
40 Nuts, Eggs, Milk itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
62 June 16, 2021
Basic Concepts: Association Rules
Tid Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional
Customer
Customer probability that a transaction
buys both
buys having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys beer Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
63 June 16, 2021
64 University of Waikato 06/16/21
65 University of Waikato 06/16/21
66 University of Waikato 06/16/21
67 University of Waikato 06/16/21
68 University of Waikato 06/16/21
Explorer: attribute selection
Panel that can be used to investigate which (subsets of)
attributes are the most predictive ones
Attribute selection methods contain two parts:
A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
69 06/16/21
70 University of Waikato 06/16/21
71 University of Waikato 06/16/21
72 University of Waikato 06/16/21
73 University of Waikato 06/16/21
74 University of Waikato 06/16/21
75 University of Waikato 06/16/21
76 University of Waikato 06/16/21
77 University of Waikato 06/16/21
Explorer: data visualization
Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
WEKA can visualize single attributes (1-d) and pairs
of attributes (2-d)
To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values
“Jitter” option to deal with nominal attributes (and to
detect “hidden” data points)
“Zoom-in” function
78 06/16/21
79 University of Waikato 06/16/21
80 University of Waikato 06/16/21
81 University of Waikato 06/16/21
82 University of Waikato 06/16/21
83 University of Waikato 06/16/21
84 University of Waikato 06/16/21
85 University of Waikato 06/16/21
86 University of Waikato 06/16/21
87 University of Waikato 06/16/21
88 University of Waikato 06/16/21
References and Resources
References:
WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
WEKA Tutorial:
Machine Learning with WEKA: A presentation demonstrating all graphical
user interfaces (GUI) in Weka.
A presentation which explains how to use Weka for exploratory data
mining.
WEKA Data Mining Book:
Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)
WEKA Wiki:
http://weka.sourceforge.net/wiki/index.php/Main_Page
Others:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 2nd ed.