Assessment, Synthesis and Analysis of Data Mining Tools
Assessment, Synthesis and Analysis of Data Mining Tools
ABSTRACT
Data mining is an emerging field in many disciplines. It is becoming increasingly necessary to find data
mining packages appropriate for a given analysis. The reasons as to why MATLAB is the correct kind of
package to use and its particular advantages with regards data mining are discussed in this work.
MATLAB already supports various implementations of different stages of the data mining process,
including various toolboxes created by experts in the field. An initial conclusion of this study is that
MATLAB is a powerful and versatile package for fulfilling the requirements of the data mining process.
It is clear, however, that there is a need for the extension and synthesis of the existing tools. Three such
tools have been investigated fully; analysis of each tool is provided, with recommendations for further
extensions.
The synthesis of data mining tools outlined and demonstrated in this paper allows for a far more holistic
approach to data mining in MATLAB than has been available previously. This work ensures that data
mining becomes an increasingly straightforward task, as the appropriate tools for a given analysis
become apparent. As a logical extension of the synthesis provided, a brief discussion is given with
regard the creation of a data mining toolbox for MATLAB.
The open-endedness of this study provides many areas for further investigation and further synthesis,
both within MATLAB and in the field of data mining as a whole.
KEYWORDS: Data mining, Neural Network, Fuzzy Clustering, Association Rule Miner.
1. PROBLEM STATEMENT
As data repositories grow, there is an increasing need for data mining tools, which are able to glean
information from data sets that are not easily understood by traditional observation or experiment.
Data mining is the means used in extracting hidden knowledge from a data set; this would be knowledge
that is not readily obtained by traditional means such as queries or statistical analysis [Roiger and Geatz
2003]. Hidden knowledge can be used for classification and estimation of new instances and for
prediction of future events [Roiger and Geatz 2003].
MATLAB has been used in the development of data mining tools but it is required to know to what
extent the requirements of the data mining process are met by the tools currently available for MATLAB
and hence by the package as a whole. In addition, it is required to know the necessity and feasibility of
creating a toolbox dedicated to data mining.
Hence, the aim of this paper is to provide, not only an analysis of selected data mining tools available
within MATLAB and a synthesis of these tools, but more importantly, a means to analyse and synthesise
further data mining tools, thus providing an increasingly holistic view of the data mining capabilities of
MATLAB.
Essentially then, we wish to discover the extent to which each of a number of MATLAB data mining
tools is capable of carrying out the different stages of the data mining process. We wish to synthesise
these tools in order to bring greater clarity to the potential of MATLAB in the data mining arena and to
provide recommendations for further extension to these tools in light of this analysis and synthesis. And,
as we do this, to clearly define the methodology used in carrying out this work, in order that it might be
used in future work in this area.
In summary, our aim is to create a means for obtaining a holistic view of the data mining capabilities of
MATLAB. We will accomplish this by setting forth the methodology of this process and by
demonstrating this methodology by investigating and synthesising several data mining tools available for
MATLAB.
2. RESEARCH OVERVIEW
Due to the broad and open-ended nature of this study it is vital that we focus on a number of specific
tools and case studies. The data mining tools around which this thesis will revolve are: the Neural
Network Toolbox, a proprietary tool available from The MathWorks, distributors of MATLAB. The
Fuzzy Clustering and Data Analysis Toolbox [Balasko et al. 2005] and the Association Rule Miner and
Deduction Analysis tool [Malone 2003], which are both open source; and lastly an implementation of the
C4.5 decision tree algorithm [Woolf, 2005]. A number of specific implementations of our process of
synthesis are analyzed. This entails the use of different case studies on separate data sets using the same
process. The crucial difference between these data sets was that the dependant data attribute was
continuous in nature for one of the data sets and categorical for the others.
3. METHODOLOGY USED
The approach of this study can be broken down into two distinct phases. The first phase was that of
analysis and assessment of the package, where we aim to validate the tool’s claims and to suggest
possible extensions to the tools. The second phase was that of synthesis, where we use the tools in
combination and then provide a final analysis of the results of the processes implementation.
3.2. Synthesis
Synthesis has been defined as the process of designing or building a new concept for a specific purpose,
by putting parts together in a logical way. This comes closest to what we were doing in this study and
the methodology for synthesizing the chosen tools was where the true potential of this work to impact
the way in which data mining was carried out by MATLAB was evidenced. At present no clear means
exists for the synthesis of the data mining tools in MATLAB, which was a central reason for
MATLAB’s limited popularity, particularly as a stand alone tool. The tools currently available were
either designed, with a specific application area in mind, to solve a specific problem, or merely out of a
desire to extend the capabilities of MATLAB, with little or no thought to the impact on the data mining
capabilities of MATLAB as a whole. Data mining is an extremely broad field and for MATLAB to
become a tool of choice in this field, a means must exist for the synthesis of the available tools.
The first stage in this process was to decide which tools are to be synthesised. As discussed, this paper
will focus on the synthesis of The MathWorks Neural Network Toolbox, Fuzzy Clustering and Data
Analysis Toolbox [Balasko et al. 2005] and ARMADA (Association Rule Miner and Deduction Analysis
tool) [Malone 2003].
The second stage in the process was to determine where and how these tools complement each other.
This stage was, once again, firmly rooted in the data mining process and were thus already have been
discussed during the course of the individual tool assessments. For example, where the Neural Network
Toolbox may be deficient with regards the first two stages of the process, the Fuzzy Clustering Toolbox
plays a crucial role and although the clustering tool on its own does not produce a useful model, it was a
necessary precursor to obtaining the best possible results from the potential neural network.
Essentially, the ways in which the tools complement one another were highlighted and the suggested
synthesis was then implemented.
Fig. 1: Broad and Detailed Methodology for the Synthesis of Data Mining Tools
Fig. 3: The Results of KMeans Clustering on the Japanese Business Database with Additional
Boundary Lines (---) Included
The first thing to observe from the above result was the choice to partition the data into two clusters.
This choice was a natural one, as we might expect the data set to split roughly into clusters of bankrupt
and solvent businesses. As it can be seen, this was what had happened and whilst the above clustering
undoubtedly overlaps in terms of its classification it was clear that there exist two well defined clusters,
which can be separated by the boundary lines illustrated. These boundary lines were drawn in by hand
and represent more or less what was looked for by a neural network when creating a model which can
classify a given data set.
The boundary condition present thus indicates that concept structures are very much present in the data
set and that a supervised learning model was likely to perform very well on this data set. The first phase
of the data mining process can thus be completed as a result of the above clustering. Our decision was to
go ahead with the creation of a neural network.
A further observation would be to take note of the spread of the given clusters. In this case, despite the
spread of the cluster which we have defined as representing “bankrupt” businesses, the presence of the
well defined boundary indicates that this was not likely to be a problem. In fact, those cases which might
possibly be classified as outliers are likely to be those cases most easily classified by the network as they
lie furthest from the decision boundary. The outliers we are looking to remove are those that might cause
the network trouble in its learning process. For example, if there was a classification of a bankrupt
business that fell within the decision boundary given, it would be advisable to remove that instance.
A last observation was that the spread of the clusters was limited, particularly in the case of those
businesses which had been classified as “solvent”. The reason for this was clear when we take into
account the fact that the range of the data attributes was limited, confirming once again, that there was
no need to normalise this data before constructing our neural network. In the case of the Japanese
Business data we can therefore conclude, both from an examination of the data set and the given
clusters, that data pre-processing was not necessary prior to a mine of this data.
Fig. 4: Neural Network Results for the Japanese Business Database Training (r2 = 1.000000)
Figure 4 above shows both network targets and their respective classifications. Each Japanese business
in the training set (which consists 2/3 of the entire database) was represented by both a blue circle and a
corresponding green star. The blue circles represent the actual state of the given businesses, either the
business was “solvent” (1) or “bankrupt” (0) as represented on the vertical axis; these were the target
values which the network was attempting to predict. The green stars represent the networks predictions.
It was clear that the minimum error of the network was reduced significantly, as all of the businesses in
the training set were predicted correctly. For the sake of completeness it should be noted that the r-
squared value for the training set was exactly 1.0 indicating a perfect fit. These results were not
particularly unusual; the real test comes in examining the networks ability to infer from what it “knows”
in order to predict unseen cases. This is represented by the results of the test set, depicted in Figure 5.
Fig. 5: Neural Network Results for the Japanese Business Database Testing (r2 = 0.777778)
The results of testing were of greatest interest to us as they show the potential of the given model to
accurately predict the financial position of a business based on the given financial statistics. The
remaining 1/3 of the database was used for the purposes of testing. No validation set was used as the
database was too small to warrant reducing the training set, which needs to be sufficiently large in order
for the model to be able to generalize well.
Figure 5 shows that one of the businesses that did in fact go bankrupt was predicted as being solvent.
The r-squared value for the training data is 0.78. This was an acceptable rate of error for such a model
and we can thus conclude that we have been successful in creating a neural network that has mined the
given data set and is able to classify unseen instances within the given domain with a high degree of
confidence.
Important to note is the ‘minimum support’ required as this has a greater effect on the number of rules
generated than ‘minimum confidence’. If ‘minimum support’ is set to 1 ARMADA cannot mine the
entire file. If only 25% of the file is mined with a support of 1, the number of rules generated is well over
2000. The last thing to note is that these criteria are relatively low and will allow for the extraction of a
reasonable number of rules.
Fig. 6: Broad Rule Mining Criteria for the Japanese Business Database
It is important to note from the above goals that any entry within the database, other than those which
have been specified as consequents, can be used as antecedents of any given rule. It should also be noted
that 1 and -1 are the only consequents which will be generated by the rule miner. This is seen most
clearly in Table 1 below. The results of the given mine are summarised in Figure 8 and listed in Table 1
using the “Dump To Cmd Win” button, seen in Figure 8.
Rule Support Confidence
0.064 ->1 Sup=4 Conf=100
0.034 ->1 Sup=4 Conf=100
0.096 ->1 Sup=3 Conf=100
0.046 ->1 Sup=3 Conf=100
0.023 ->1 Sup=3 Conf=100
0.019 ->1 Sup=3 Conf=100
0.374 ->1 Sup=2 Conf=100
0.273 ->1 Sup=2 Conf=100
0.212 ->1 Sup=2 Conf=100
0.197 ->1 Sup=2 Conf=100
0.112 ->1 Sup=2 Conf=100
0.106 ->1 Sup=2 Conf=100
0.089 ->1 Sup=2 Conf=100
0.083 ->1 Sup=2 Conf=100
0.07 ->1 Sup=2 Conf=100
0.065 ->1 Sup=2 Conf=100
0.049 ->1 Sup=2 Conf=100
0.047 ->1 Sup=2 Conf=100
0.045 ->1 Sup=2 Conf=100
0.043 ->1 Sup=2 Conf=100
0.041 ->1 Sup=2 Conf=100
0.038 ->1 Sup=2 Conf=100
0.027 ->1 Sup=2 Conf=100
0.02 ->1 Sup=2 Conf=100
0.013 ->1 Sup=2 Conf=100
0.007 ->1 Sup=2 Conf=100
0.003 ->1 Sup=2 Conf=100
0.001 ->1 Sup=2 Conf=100
-0.01 ->1 Sup=2 Conf=100
0.079 ->1 Sup=2 Conf=66.67
0.099 ->1 Sup=2 Conf=50
0.065 0.07 ->1 Sup=2 Conf=100
0.046 0.049 ->1 Sup=2 Conf=100
0.023 0.374 ->1 Sup=2 Conf=100
0.019 0.112 ->1 Sup=2 Conf=100
-0.0539 -> -1 Sup=3 Conf=50
0.0785 -> -1 Sup=2 Conf=100
Table 1: Association Rules for the Japanese Business Database
(1=Solvent, -1=Bankrupt)
The first rule in this rule set (0.064 ->1 Sup=4 Conf=100), can be interpreted as follows. The numerical
attribute value, 0.064, will imply that a given business is solvent (rule output of 1). This rule is supported
by 4 instances and contradicted by none, leading to 100% confidence in the accuracy of this rule.
Stronger rules are those with two attributes producing a given outcome; for example, 0.065 0.07 ->1
Sup=2 Conf=100.
An initial observation from these results is the fact that there are many more rules generated for solvent
businesses (1) than for those that went bankrupt (-1). This would be a likely reason for the
misclassification of one of the bankrupt businesses by our neural network as there are fewer rules from
which it can “learn” the characteristics of a bankrupt business. Another reason for this misclassification
may also be the fact that the network had more solvent businesses to “learn” from than bankrupt ones.
Figure 8 summarises this information by giving the count of solvent and bankrupt businesses as being 27
and 25 respectively. A solution to this problem would be to obtain more data from Japanese businesses
and to train a new network on this data. If this were possible it is likely that more association rules would
be generated for classifying bankrupt businesses and not only that, but the neural network used for the
purposes of classification would likely perform even better than it currently does.
Though the interpretation of these rules is difficult because of their numerical nature, it is clear that this
information is likely to be exactly what a neural network is looking for in its process of learning. The
strong relationships within the data are again seen in the fact that most of the rules have been generated
with a confidence of 100%, leading to less likelihood of confusion in the classification of our neural
network. The strength of relationships in the data is made even more significant when we remember that
our ‘minimum confidence’ was set at 20%, but that the lowest confidence among the association rules
generated is 50%.
In conclusion, these rules represent generalisations which are what a neural network needs to “learn” in
order to be able to solve a classification problem and particularly to classify new, unseen instances. If
few such rules existed the best we could hope for would be that the neural network “memorise” the
training set. As it is, this has not happened and we are able to conclude that we have been successful in
creating a model which is able to generalise over new cases and which is likely to predict whether a
business is either bankrupt or solvent with a very high level of accuracy.
Fig. 9: Broad and Detailed Methodology for the Synthesis of Data Mining Tools in MATLAB
In conclusion we have been able to see the data mining capabilities of MATLAB in a far more holistic
light than has been available previously. The process of synthesis outlined will enable MATLAB to be
used far more extensively in this field in future, particularly as this process is extended to other tools and
case studies. Possibilities for extension to this work are outlined below.
References
1. Adriaans, P. and Zantige, D., Data Mining, Addison Wesley, Harlow England, 1997, pp 39-42,69-
81, 91, 117, 127.
2. Balasko, B., Abonyi, J. and Feil, B., Fuzzy Clustering and Data Analysis Toolbox for Use with
Matlab, University of Veszprem, Veszprem Hungary, 2005.
3. Burton, M., AM 2.2 Mathematical Programming, Rhodes University, Grahamstown South Africa,
2006, pp 1—2.
4. Burton, M., M 4.4 Neural Networks, Rhodes University, South Africa, 2006, pp 102— 111.
5. Dwinnell, W., Modeling Methodology 5: Mathematical Programming Languages, 1998.
6. Hand, D., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press, Cambridge,
Massachusetts, 2001, pp 1—24.
7. KDnuggets, KDnuggets Past Polls, 2006.
8. Malone, J., ARMADA Association Rule Miner and Deduction Analysis, 2003, Accessed: 11 May
2006.
9. Mangasarian, O.L., and Wolberg, W.H., Cancer diagnosis via linear programming, Siam News,
Volume 23, Number 5, September 1990, pp 1,18.
10. Murphy, K., Bayes Net Toolbox for MATLAB, 2005.
11. Paola, B. et al., Tool Selection Methodology in Data Mining, 2006.
12. Pyle, D., Data Preparation for Data Mining, Morgan Kaufman, San Francisco, California, 1999,
pp 118.
13. Roiger, R.J. and Geatz, M. W., Data Mining: A Tutorial Based Primer, Addison Wesley, USA,
2003, pp 7—27,34—41.
14. Simonoff, J. S., Analysing Categorical Data, Springer-Verlag, New York, 2003.
15. Vesanto, J., Himberg, J., Alhoniemi, E. and Parhankangas, J., SOM Toolbox for MATLAB 5,
Helsinki University of Technology, Helsinki Finland, 2000.
16. Vlachos, M., A Practical Time-Series Tutorial with MATLAB, Hawthorne, NY, 2005, pp 7—12.
17. Woolf, R. J., Data Mining Using MATLAB, University of Southern Queensland, Queensland
Australia, 2005, pp 1—3,10—12,59.