0% found this document useful (0 votes)
45 views13 pages

Assessment, Synthesis and Analysis of Data Mining Tools

The synthesis of data mining tools outlined and demonstrated in this paper allows for a far more holistic approach to data mining in MATLAB than has been available previously. This work ensures that data mining becomes an increasingly straightforward task, as the appropriate tools for a given analysis become apparent.

Uploaded by

Rajeev Prithyani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views13 pages

Assessment, Synthesis and Analysis of Data Mining Tools

The synthesis of data mining tools outlined and demonstrated in this paper allows for a far more holistic approach to data mining in MATLAB than has been available previously. This work ensures that data mining becomes an increasingly straightforward task, as the appropriate tools for a given analysis become apparent.

Uploaded by

Rajeev Prithyani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ASSESSMENT, SYNTHESIS AND ANALYSIS OF DATA MINING TOOLS

Ritu Valia*, Rajeev Kumar#


*Dept. of Computer Science, Dravadian University, Kuppam
#
Dept. of Computer Science, APS University, Rewa

ABSTRACT
Data mining is an emerging field in many disciplines. It is becoming increasingly necessary to find data
mining packages appropriate for a given analysis. The reasons as to why MATLAB is the correct kind of
package to use and its particular advantages with regards data mining are discussed in this work.
MATLAB already supports various implementations of different stages of the data mining process,
including various toolboxes created by experts in the field. An initial conclusion of this study is that
MATLAB is a powerful and versatile package for fulfilling the requirements of the data mining process.
It is clear, however, that there is a need for the extension and synthesis of the existing tools. Three such
tools have been investigated fully; analysis of each tool is provided, with recommendations for further
extensions.
The synthesis of data mining tools outlined and demonstrated in this paper allows for a far more holistic
approach to data mining in MATLAB than has been available previously. This work ensures that data
mining becomes an increasingly straightforward task, as the appropriate tools for a given analysis
become apparent. As a logical extension of the synthesis provided, a brief discussion is given with
regard the creation of a data mining toolbox for MATLAB.
The open-endedness of this study provides many areas for further investigation and further synthesis,
both within MATLAB and in the field of data mining as a whole.

KEYWORDS: Data mining, Neural Network, Fuzzy Clustering, Association Rule Miner.

1. PROBLEM STATEMENT
As data repositories grow, there is an increasing need for data mining tools, which are able to glean
information from data sets that are not easily understood by traditional observation or experiment.
Data mining is the means used in extracting hidden knowledge from a data set; this would be knowledge
that is not readily obtained by traditional means such as queries or statistical analysis [Roiger and Geatz
2003]. Hidden knowledge can be used for classification and estimation of new instances and for
prediction of future events [Roiger and Geatz 2003].
MATLAB has been used in the development of data mining tools but it is required to know to what
extent the requirements of the data mining process are met by the tools currently available for MATLAB
and hence by the package as a whole. In addition, it is required to know the necessity and feasibility of
creating a toolbox dedicated to data mining.
Hence, the aim of this paper is to provide, not only an analysis of selected data mining tools available
within MATLAB and a synthesis of these tools, but more importantly, a means to analyse and synthesise
further data mining tools, thus providing an increasingly holistic view of the data mining capabilities of
MATLAB.
Essentially then, we wish to discover the extent to which each of a number of MATLAB data mining
tools is capable of carrying out the different stages of the data mining process. We wish to synthesise
these tools in order to bring greater clarity to the potential of MATLAB in the data mining arena and to
provide recommendations for further extension to these tools in light of this analysis and synthesis. And,
as we do this, to clearly define the methodology used in carrying out this work, in order that it might be
used in future work in this area.
In summary, our aim is to create a means for obtaining a holistic view of the data mining capabilities of
MATLAB. We will accomplish this by setting forth the methodology of this process and by
demonstrating this methodology by investigating and synthesising several data mining tools available for
MATLAB.

2. RESEARCH OVERVIEW
Due to the broad and open-ended nature of this study it is vital that we focus on a number of specific
tools and case studies. The data mining tools around which this thesis will revolve are: the Neural
Network Toolbox, a proprietary tool available from The MathWorks, distributors of MATLAB. The
Fuzzy Clustering and Data Analysis Toolbox [Balasko et al. 2005] and the Association Rule Miner and
Deduction Analysis tool [Malone 2003], which are both open source; and lastly an implementation of the
C4.5 decision tree algorithm [Woolf, 2005]. A number of specific implementations of our process of
synthesis are analyzed. This entails the use of different case studies on separate data sets using the same
process. The crucial difference between these data sets was that the dependant data attribute was
continuous in nature for one of the data sets and categorical for the others.

3. METHODOLOGY USED
The approach of this study can be broken down into two distinct phases. The first phase was that of
analysis and assessment of the package, where we aim to validate the tool’s claims and to suggest
possible extensions to the tools. The second phase was that of synthesis, where we use the tools in
combination and then provide a final analysis of the results of the processes implementation.

3.1. Tool Assessment


This phase was essentially an extended documentation of the tool. It should be clear that some
experimentation with the tools was necessary in order to obtain these results. The case studies carried out
were the best examples of such work. Provision of the details of every experiment was not possible
within this work.
The first stage of the assessment process was a critical evaluation of the claims of the tool. These claims
were evaluated in terms of that stage of the data mining process to which they pertain. Once the claims
had been validated, or otherwise, suggestions were made as to how the tool could best be improved so as
to fulfil its purpose more completely.
Second, the applications of the tool were briefly explored, that is, the stage or stages of the data mining
process to which the tool was best suited were highlighted. The case studies would serve as specific
examples of these applications and would be investigated in greater depth so as to attain examples of
implementations of the tools, which obtained meaningful results.
The final step in the assessment process was to further clarify ideas pertaining to the synthesis of the
tool. Naturally each tool was suited to a particular domain and was constrained to fulfill certain stages of
the data mining process more completely than others. This fact highlights both the need for synthesis and
that which makes synthesis possible.
Essentially, this stage of our methodology was aimed at focusing us on the area of greatest potential for
the given tool so as to streamline the synthesis process.

3.2. Synthesis
Synthesis has been defined as the process of designing or building a new concept for a specific purpose,
by putting parts together in a logical way. This comes closest to what we were doing in this study and
the methodology for synthesizing the chosen tools was where the true potential of this work to impact
the way in which data mining was carried out by MATLAB was evidenced. At present no clear means
exists for the synthesis of the data mining tools in MATLAB, which was a central reason for
MATLAB’s limited popularity, particularly as a stand alone tool. The tools currently available were
either designed, with a specific application area in mind, to solve a specific problem, or merely out of a
desire to extend the capabilities of MATLAB, with little or no thought to the impact on the data mining
capabilities of MATLAB as a whole. Data mining is an extremely broad field and for MATLAB to
become a tool of choice in this field, a means must exist for the synthesis of the available tools.
The first stage in this process was to decide which tools are to be synthesised. As discussed, this paper
will focus on the synthesis of The MathWorks Neural Network Toolbox, Fuzzy Clustering and Data
Analysis Toolbox [Balasko et al. 2005] and ARMADA (Association Rule Miner and Deduction Analysis
tool) [Malone 2003].
The second stage in the process was to determine where and how these tools complement each other.
This stage was, once again, firmly rooted in the data mining process and were thus already have been
discussed during the course of the individual tool assessments. For example, where the Neural Network
Toolbox may be deficient with regards the first two stages of the process, the Fuzzy Clustering Toolbox
plays a crucial role and although the clustering tool on its own does not produce a useful model, it was a
necessary precursor to obtaining the best possible results from the potential neural network.
Essentially, the ways in which the tools complement one another were highlighted and the suggested
synthesis was then implemented.

3.3. Final Analysis


Lastly, the process as a whole needs to be evaluated. This was crucial in providing a holistic view of
what had been achieved by the synthesis for data mining in MATLAB as a whole. The results of this last
stage were evaluated in order to determine what had been gained and the new process can then, in many
ways, be viewed as a “tool” in its own right.
The process was thus a circular one, as illustrated in Figure 1 below. This fact makes it all the more clear
how a complete overview of MATLAB can be attained; once all of the currently available data mining
tools had been assessed and synthesised. MATLAB would gain from this in many important ways. Most
crucially perhaps, it would be clear to what extent MATLAB was capable of carrying out the data
mining process as a whole and if it lacks in any area, this too would be evidenced through the extension
of this work.

Fig. 1: Broad and Detailed Methodology for the Synthesis of Data Mining Tools

4. IMPLEMENTATION AND RESULTS


The best possible combination of our three tools was an initial analysis using the Fuzzy Clustering and
Data Analysis Toolbox [Balasko et al. 2005], followed by the creation of a neural network using The
Mathworks Neural Network Toolbox, if this was deemed appropriate, and finally an analysis of results
using ARMADA [Malone 2003]. Essentially then, the decision phase of our data mining process would
be carried out entirely using unsupervised clustering. The data preparation phase would be shared by the
fuzzy clustering tool and the data preparation script of our neural network, which would be used to
normalise the data if necessary. The model building phase were given entirely to the Neural Network
Toolbox and the interpretation of results were handled by the Association Rule Miner, complimented by
the simulation script of our neural network. This process is depicted in Figure 2. This process was
carried on the Japanese Business database and then compared these results with those obtained from the
Isomerisation database.

Fig. 2: Synthesis Implemented

4.1. Fuzzy Clustering and Data Analysis Toolbox


The clustering tool was applied to the entire Japanese Business database and the results of that clustering
are shown in Figure 3 below. The validity measures given to this clustering are not discussed due to the
fact that these results can be concluded to be more than adequate by observation.

Fig. 3: The Results of KMeans Clustering on the Japanese Business Database with Additional
Boundary Lines (---) Included
The first thing to observe from the above result was the choice to partition the data into two clusters.
This choice was a natural one, as we might expect the data set to split roughly into clusters of bankrupt
and solvent businesses. As it can be seen, this was what had happened and whilst the above clustering
undoubtedly overlaps in terms of its classification it was clear that there exist two well defined clusters,
which can be separated by the boundary lines illustrated. These boundary lines were drawn in by hand
and represent more or less what was looked for by a neural network when creating a model which can
classify a given data set.
The boundary condition present thus indicates that concept structures are very much present in the data
set and that a supervised learning model was likely to perform very well on this data set. The first phase
of the data mining process can thus be completed as a result of the above clustering. Our decision was to
go ahead with the creation of a neural network.
A further observation would be to take note of the spread of the given clusters. In this case, despite the
spread of the cluster which we have defined as representing “bankrupt” businesses, the presence of the
well defined boundary indicates that this was not likely to be a problem. In fact, those cases which might
possibly be classified as outliers are likely to be those cases most easily classified by the network as they
lie furthest from the decision boundary. The outliers we are looking to remove are those that might cause
the network trouble in its learning process. For example, if there was a classification of a bankrupt
business that fell within the decision boundary given, it would be advisable to remove that instance.
A last observation was that the spread of the clusters was limited, particularly in the case of those
businesses which had been classified as “solvent”. The reason for this was clear when we take into
account the fact that the range of the data attributes was limited, confirming once again, that there was
no need to normalise this data before constructing our neural network. In the case of the Japanese
Business data we can therefore conclude, both from an examination of the data set and the given
clusters, that data pre-processing was not necessary prior to a mine of this data.

4.2. Neural Network Toolbox


The next phase of our process was to create the neural network which would be used to classify different
businesses as being either “bankrupt” or “solvent”. The network structure chosen used 8 neurons in the
input layer (corresponding to the 8 input attributes), two hidden layers containing 14 and 12 neurons
respectively and a single neuron in the output layer. The number of neurons used in the hidden layers
was obtained by trial and error. A general rule with regard the choice of network architecture was that
the complexity of the database (that is the number of attributes being fed into the network, which is 8 in
this case) determines the complexity (number of neurons used) in the neural network being constructed.
The output neuron gives the classification of either “bankrupt” or “solvent”. Lastly, the transfer function
used in the hidden layers is the MATLAB tansig function, which is equivalent to the hyperbolic tangent,
tanh. The reason for this choice was that the attribute values were both negative and positive; tansig is
able to deal with both cases whereas a logarithmic sigmoid function, such as logsig, would have missed
all negative values. The results of the trained model are shown in Figure 4 below and the testing of this
model is shown in Figure 5.

Fig. 4: Neural Network Results for the Japanese Business Database Training (r2 = 1.000000)

Figure 4 above shows both network targets and their respective classifications. Each Japanese business
in the training set (which consists 2/3 of the entire database) was represented by both a blue circle and a
corresponding green star. The blue circles represent the actual state of the given businesses, either the
business was “solvent” (1) or “bankrupt” (0) as represented on the vertical axis; these were the target
values which the network was attempting to predict. The green stars represent the networks predictions.
It was clear that the minimum error of the network was reduced significantly, as all of the businesses in
the training set were predicted correctly. For the sake of completeness it should be noted that the r-
squared value for the training set was exactly 1.0 indicating a perfect fit. These results were not
particularly unusual; the real test comes in examining the networks ability to infer from what it “knows”
in order to predict unseen cases. This is represented by the results of the test set, depicted in Figure 5.

Fig. 5: Neural Network Results for the Japanese Business Database Testing (r2 = 0.777778)
The results of testing were of greatest interest to us as they show the potential of the given model to
accurately predict the financial position of a business based on the given financial statistics. The
remaining 1/3 of the database was used for the purposes of testing. No validation set was used as the
database was too small to warrant reducing the training set, which needs to be sufficiently large in order
for the model to be able to generalize well.
Figure 5 shows that one of the businesses that did in fact go bankrupt was predicted as being solvent.
The r-squared value for the training data is 0.78. This was an acceptable rate of error for such a model
and we can thus conclude that we have been successful in creating a neural network that has mined the
given data set and is able to classify unseen instances within the given domain with a high degree of
confidence.

4.3. Association Rule Miner


The final phase of our process was to use ARMADA in the construction of association rules, in order to
give us a better idea of what the neural network which we have constructed has “learnt”. The central
motivation for this was to aid us in the interpretation of the results which have been attained by our
neural network. This was the most frustrating part of the mining process, largely due to the tool’s poor
documentation. It took many separate mining attempts to obtain the following results, with many
erroneous and unhelpful error messages having to be circumvented by trial and error. One example of
such a difficulty was the discovery that ARMADA does not recognise 0 as a consequent or even as a
numeric value in the mining of association rules. We were thus forced to change all the consequents for
the bankrupt businesses to -1 and then perform our mine on this data in order to obtain results which
dealt with both bankrupt and solvent cases.
A screenshot of the criteria used in mining the Japanese Business database is provided in Figure 6.
Figure 7 shows the goals which were built in order to perform the required supervised data mining
session. The rules generated are summarised in Figure 8 and listed in Table 1.
The broad criteria used in mining the Japanese Business database were as follows:

• Minimum Confidence = 20% (% of times that LHS=>RHS is true)


• Minimum Support = 2 (no. of times the given rule appears)
• Mined Using Built Goals (required for supervised rule mining see Figure 7)
• Mined Using Entire File (database is relatively small)

Important to note is the ‘minimum support’ required as this has a greater effect on the number of rules
generated than ‘minimum confidence’. If ‘minimum support’ is set to 1 ARMADA cannot mine the
entire file. If only 25% of the file is mined with a support of 1, the number of rules generated is well over
2000. The last thing to note is that these criteria are relatively low and will allow for the extraction of a
reasonable number of rules.
Fig. 6: Broad Rule Mining Criteria for the Japanese Business Database

Fig.7: Rule Mining Criteria for the Japanese Business Database


- Goals Built (1=Solvent, -1=Bankrupt)
Fig. 8: Results of Mining the Japanese Business Database

It is important to note from the above goals that any entry within the database, other than those which
have been specified as consequents, can be used as antecedents of any given rule. It should also be noted
that 1 and -1 are the only consequents which will be generated by the rule miner. This is seen most
clearly in Table 1 below. The results of the given mine are summarised in Figure 8 and listed in Table 1
using the “Dump To Cmd Win” button, seen in Figure 8.
Rule Support Confidence
0.064 ->1 Sup=4 Conf=100
0.034 ->1 Sup=4 Conf=100
0.096 ->1 Sup=3 Conf=100
0.046 ->1 Sup=3 Conf=100
0.023 ->1 Sup=3 Conf=100
0.019 ->1 Sup=3 Conf=100
0.374 ->1 Sup=2 Conf=100
0.273 ->1 Sup=2 Conf=100
0.212 ->1 Sup=2 Conf=100
0.197 ->1 Sup=2 Conf=100
0.112 ->1 Sup=2 Conf=100
0.106 ->1 Sup=2 Conf=100
0.089 ->1 Sup=2 Conf=100
0.083 ->1 Sup=2 Conf=100
0.07 ->1 Sup=2 Conf=100
0.065 ->1 Sup=2 Conf=100
0.049 ->1 Sup=2 Conf=100
0.047 ->1 Sup=2 Conf=100
0.045 ->1 Sup=2 Conf=100
0.043 ->1 Sup=2 Conf=100
0.041 ->1 Sup=2 Conf=100
0.038 ->1 Sup=2 Conf=100
0.027 ->1 Sup=2 Conf=100
0.02 ->1 Sup=2 Conf=100
0.013 ->1 Sup=2 Conf=100
0.007 ->1 Sup=2 Conf=100
0.003 ->1 Sup=2 Conf=100
0.001 ->1 Sup=2 Conf=100
-0.01 ->1 Sup=2 Conf=100
0.079 ->1 Sup=2 Conf=66.67
0.099 ->1 Sup=2 Conf=50
0.065 0.07 ->1 Sup=2 Conf=100
0.046 0.049 ->1 Sup=2 Conf=100
0.023 0.374 ->1 Sup=2 Conf=100
0.019 0.112 ->1 Sup=2 Conf=100
-0.0539 -> -1 Sup=3 Conf=50
0.0785 -> -1 Sup=2 Conf=100
Table 1: Association Rules for the Japanese Business Database
(1=Solvent, -1=Bankrupt)

The first rule in this rule set (0.064 ->1 Sup=4 Conf=100), can be interpreted as follows. The numerical
attribute value, 0.064, will imply that a given business is solvent (rule output of 1). This rule is supported
by 4 instances and contradicted by none, leading to 100% confidence in the accuracy of this rule.
Stronger rules are those with two attributes producing a given outcome; for example, 0.065 0.07 ->1
Sup=2 Conf=100.
An initial observation from these results is the fact that there are many more rules generated for solvent
businesses (1) than for those that went bankrupt (-1). This would be a likely reason for the
misclassification of one of the bankrupt businesses by our neural network as there are fewer rules from
which it can “learn” the characteristics of a bankrupt business. Another reason for this misclassification
may also be the fact that the network had more solvent businesses to “learn” from than bankrupt ones.
Figure 8 summarises this information by giving the count of solvent and bankrupt businesses as being 27
and 25 respectively. A solution to this problem would be to obtain more data from Japanese businesses
and to train a new network on this data. If this were possible it is likely that more association rules would
be generated for classifying bankrupt businesses and not only that, but the neural network used for the
purposes of classification would likely perform even better than it currently does.
Though the interpretation of these rules is difficult because of their numerical nature, it is clear that this
information is likely to be exactly what a neural network is looking for in its process of learning. The
strong relationships within the data are again seen in the fact that most of the rules have been generated
with a confidence of 100%, leading to less likelihood of confusion in the classification of our neural
network. The strength of relationships in the data is made even more significant when we remember that
our ‘minimum confidence’ was set at 20%, but that the lowest confidence among the association rules
generated is 50%.
In conclusion, these rules represent generalisations which are what a neural network needs to “learn” in
order to be able to solve a classification problem and particularly to classify new, unseen instances. If
few such rules existed the best we could hope for would be that the neural network “memorise” the
training set. As it is, this has not happened and we are able to conclude that we have been successful in
creating a model which is able to generalise over new cases and which is likely to predict whether a
business is either bankrupt or solvent with a very high level of accuracy.

5. FINDINGS AND CONCLUSION


Our central aim in this work was to not only provide an analysis and synthesis of data mining tools in
MATLAB, but also a methodology which can be used in continuing this work into the future and
possibly extending it to the creation of a data mining toolbox. We have been successful in creating such
a methodology (see Figure 9 below for a summary) and we have validated these findings by evaluating
and synthesizing three MATLAB data mining tools.

Fig. 9: Broad and Detailed Methodology for the Synthesis of Data Mining Tools in MATLAB

In conclusion we have been able to see the data mining capabilities of MATLAB in a far more holistic
light than has been available previously. The process of synthesis outlined will enable MATLAB to be
used far more extensively in this field in future, particularly as this process is extended to other tools and
case studies. Possibilities for extension to this work are outlined below.

References
1. Adriaans, P. and Zantige, D., Data Mining, Addison Wesley, Harlow England, 1997, pp 39-42,69-
81, 91, 117, 127.
2. Balasko, B., Abonyi, J. and Feil, B., Fuzzy Clustering and Data Analysis Toolbox for Use with
Matlab, University of Veszprem, Veszprem Hungary, 2005.
3. Burton, M., AM 2.2 Mathematical Programming, Rhodes University, Grahamstown South Africa,
2006, pp 1—2.
4. Burton, M., M 4.4 Neural Networks, Rhodes University, South Africa, 2006, pp 102— 111.
5. Dwinnell, W., Modeling Methodology 5: Mathematical Programming Languages, 1998.
6. Hand, D., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press, Cambridge,
Massachusetts, 2001, pp 1—24.
7. KDnuggets, KDnuggets Past Polls, 2006.
8. Malone, J., ARMADA Association Rule Miner and Deduction Analysis, 2003, Accessed: 11 May
2006.
9. Mangasarian, O.L., and Wolberg, W.H., Cancer diagnosis via linear programming, Siam News,
Volume 23, Number 5, September 1990, pp 1,18.
10. Murphy, K., Bayes Net Toolbox for MATLAB, 2005.
11. Paola, B. et al., Tool Selection Methodology in Data Mining, 2006.
12. Pyle, D., Data Preparation for Data Mining, Morgan Kaufman, San Francisco, California, 1999,
pp 118.
13. Roiger, R.J. and Geatz, M. W., Data Mining: A Tutorial Based Primer, Addison Wesley, USA,
2003, pp 7—27,34—41.
14. Simonoff, J. S., Analysing Categorical Data, Springer-Verlag, New York, 2003.
15. Vesanto, J., Himberg, J., Alhoniemi, E. and Parhankangas, J., SOM Toolbox for MATLAB 5,
Helsinki University of Technology, Helsinki Finland, 2000.
16. Vlachos, M., A Practical Time-Series Tutorial with MATLAB, Hawthorne, NY, 2005, pp 7—12.
17. Woolf, R. J., Data Mining Using MATLAB, University of Southern Queensland, Queensland
Australia, 2005, pp 1—3,10—12,59.

You might also like