DataMining_Chapter1
DataMining_Chapter1
This chapter provides an introduction to the field of data mining. We will explain its purpose,
applications and basic concepts.
10
1.3 AREAS OF APPLICATION
Data mining is a field of interest for three types of people:
- Scientists: to understand certain phenomena.
- Analysts: to produce reports for decision-makers.
- Decision-makers: to support decision-making.
Data mining applications are emerging in many different areas such as: e-commerce, health
care, finance, telecommunications, … etc.
- E-commerce: sales companies use data mining to improve customer relationship
management. The aim is not just to find out "how many customers bought this product during
this period", but to find out "what their profile is", "what other products will interest them"
and "when will they be interested again".
- Health care: In this field, data mining can be used to analyse risk factors and combinations
of symptoms in order to detect diseases at an early stage and develop treatment strategies.
- Finance: Financial institutions (banks) use data mining to assess the creditworthiness and
default risk of borrowers and to support credit decisions.
- Telecommunications: Analysis of customer behavioural data in order to identify possible
factors of customer dissatisfaction and take targeted measures to build customer loyalty.
- Fraud detection: To tackle the problem of fraud, organisations (customs, tax departments,
...., etc.) are using data mining to exploit massive volumes of data and identify these risks. By
implementing a fraud detection system, suspicious behaviour can be quickly identified and
proactive action taken.
11
1.5.1 Classification:
Classification is the most common task in data mining, and one that seems to be a human
obligation. In order to understand our daily lives, we are constantly being classified,
categorised and evaluated [5].
Classification involves studying the characteristics of a new object in order to assign it a
predefined class. The objects to be classified are generally records in a dataset, and
classification involves updating each record by determining a class field. The classification
task is characterised by a precise class definition and a set of previously classified examples.
The aim is to create a model that can be applied to unclassified data in order to classify it.
Some examples of the use of classification tasks in the fields of research and commerce are as
follows [5]:
Determining whether the use of a credit card is fraudulent.
Diagnosing whether a certain disease is present. [ 5]
Determining which telephone numbers correspond to fax machines.
Determining which telephone lines are used for Internet access.
1.5.2 Regression:
Regression (or estimation) is similar to classification except that the output variable is
numeric rather than categorical. Depending on the other fields in the record, estimation
involves filling in a missing value in a particular field. For example, we want to estimate the
systolic blood pressure reading of a patient in a hospital, based on the patient's age, gender,
body mass index and blood sodium level. The relationship between the systolic pressure and
the other data will provide an estimation model. This model can then be applied to other cases
[5].
Some examples of the use of estimation tasks in research and commerce are as follows [5]:
- Estimate the number of children in a family.
- Estimating the amount of money that a family of four members chosen. Randomly will
spend on going back to school.
- Estimating the value of a piece of real estate.
1.5.3 Prediction:
Prediction is the same as classification and estimation, except that in prediction records are
classified according to predicted (estimated) criteria (or values). The main reason that
prediction differs from classification and estimation is that in the creation of the predictive
model the temporal relationship between the input variables and the output variables is taken
into account. [5]
Some examples of the use of prediction tasks in research and commerce are as follows:
- Predicting the price of shares on the stock market over the next three months.
- Predicting the World Cup football champion based on a comparison of team statistics.
12
1.5.4 Association Rules Analysis:
Association rules analysis consists of determining which attributes "go together". The most
common task in the business world, where it is called affinity analysis or market basket
analysis, is association research to measure the relationship between two or more attributes.
Association rules are of the form "If antecedent, then consequent."
Some examples of the use of association rule analysis tasks in the fields of research and
commerce are as follows:
-Find out which products in a supermarket are bought together and which are never bought
together.
-Determine the proportion of cases in which a new drug can generate dangerous effects [5].
1.5.6 Description :
Sometimes the aim of datamining is simply to describe what is happening in a complicated
dataset by explaining the relationships that exist in the data, in order to gain the best possible
understanding of the individuals, products and processes present in the dataset. A good
description of behaviour often implies a good explanation of it. For example, in American
society the simple description, "women support the Democratic Party more than men", can
provoke a great deal of interest and promote studies by journalists, sociologists, economists
and political specialists [5].
13
Data mining is therefore a stage in the process of extracting knowledge (knowledge discovery
from data), which consists of applying data analysis algorithms" [4] (fig 1.2).
14
- Pattern evaluation and presentation (visualisation, transformation, removal of redundant
patterns, etc).
- Use of the extracted knowledge
This definition means that Learning is a process, not a product, and it involves change in
knowledge, beliefs, behaviours, or attitudes [6].
In data mining, models use learning to implement and train algorithms on large amount of
data so that they can make predictions new data.
There are two types of learning: supervised and unsupervised.
15
Supervised learning Unsupervised learning
Methodology Need for an expert No need for an expert
Data Need for training data labelled by the No need for labelled data
expert
Learning process Learning is based on the data labelled Learning is based on unlabelled
by the expert to build a prediction data by looking for similarities
model. between these data.
Advantages The results obtained are generally No need for labelled data. So
reliable (high accuracy score) less expensive.
Disadvantages Preparing the labelled data sets The results are generally less
required for learning is very costly reliable than those of
(time, money). unsupervised learning
Examples of SVM, KNN, Random-Forest, … Kmeans, Dbscan, …
methods
16
# Sky Temperature (in F) Humidity Wind Play
10 rainy 71 80 Yes No
11 rainy 65 70 Yes No
12 rainy 75 80 No Yes
13 rainy 68 80 No Yes
14 rainy 70 96 No Yes
Quinlan's table contains data on the opinions of a group of 14 tennis players. These players
were asked whether they would be willing to play in certain weather conditions (sky
conditions, temperature, humidity and wind). Their opinion (Yes or No) appears in the last
column of the table. Quinlan introduced this table to present the decision tree model, but this
dataset is currently used in several data mining models.
17
Fig 1.4 Validation principle
It is important to note that the 70-30 percent split is not an immutable rule. We can also
choose 80-20 or 75-25 percent, or whatever. But in all cases, the training part must be at least
equal to 70 percent.
Cross-validation consists of repeating the operation described above several times, changing
the content of the training set (70%) and the test set (30%) each time. The aim is to obtain the
best performance from the model to be built.
In concrete terms, cross-validation works as follows [7]:
1. Split the data into k folds.
2. Train the model on k-1 of the folds (i.e., train the model on all of the folds, except
one).
3. Evaluate the model on the kth holdout fold by computing performance on a metric.
4. Rotate the folds, and repeat steps 2 and 3, with a new holdout fold. Repeat steps 2 and
3 until all of the k folds have been used as the holdout fold exactly 1 time.
5. Average the model performance across all iterations.
Fig 1.5 gives an illustration of a cross-validation with 5 folds.
18
1.9 EVALUATION MEASURES
In the fields of data mining, pattern recognition, information retrieval and automatic
classification, the performance of a method/system is measured by: precision, recall and F-
score. These measures are defined below.
1.9.1 Precision
Precision is one of the most widely used measures of a system's performance. It measures a
system's ability to produce the right predictions. Let's introduce it with an example.
Example: A database contains 80 medical images corresponding to 80 patients. It is assumed
that 40 of these images concern ‘sick’ patients and the other 40 concern ‘healthy’ patients. A
query is run on an automatic diagnostic system to "search for all images concerning sick
patients ". Let's suppose that the system returns 25 images, of which only 20 are relevant (true
positive cases: the patients are really sick), and 5 are not (false positive cases: the patient are
not sick).
We can present the results using what is known as a confusion matrix (which will be defined
below). This is a 2x2 matrix where the rows represent the actual data (exact class data) and
the columns represent the prediction results (fig 1.6).
Predicted
Positive Negative
Positive TP (True Positive) FN (False Negative)
Actual
In the above, terms True | False tells us whether the prediction was right or wrong. The
second part Positive | Negative tells us what was the prediction.
Based on these considerations, precision can be defined as follows:
Definition : Precision
Precision is defined as the proportion of correct (relevant) responses returned out of all the
responses returned (correct or false).
(1)
19
Using the notation introduced above (TP, TN, FP and FN), we can define Precision in another
equivalent way:
(2)
In our example (medical images), since we have 25 responses produced by the system, 20 of
which are correct and 20 incorrect, we can calculate the precision as follows:
1.9.2 Recall
Recall is another important measure. It measures the completeness of a system's responses.
Has the system produced all the expected correct results?. Here is its definition:
Definition : Recall
Recall is defined as the proportion of correct(relevant) responses proposed out of all correct
and expected responses.
(3)
Using the notation introduced above (TP, TN, FP and FN), we can define Recall in another
equivalent way:
(4)
In our previous example (medical images), we have 20 correct responses, but the all-correct
expected responses are 40. So, recall is equal to:
1.9.3 F-score
Taken separately, measures of precision and recall are not sufficient to evaluate system
performance. That's why another measure has been proposed: the F-score. Here is its
definition:
Definition : Fscore
The f-score measurement combines precision and recall measurements. It is equal to :
(5)
Generally, we take & equal to 1, which is referred to as the F1score. F1 score is harmonic
mean of Precision and recall, which gives the same importance to precision as recall.
In our example (medical images), Recall is equal to :
20
1.9.4 Noise and Silence
Noisy data are data with a large amount of additional meaningless information called noise . it
is calculated as follows :
Noise = 1-Precision.
Conversely, the expression calculated by the following formula is called silence:
Silence = 1 – Recall.
In our example (medical images), noise is equal to (1-0.8) or 20%. Silence is equal to (1-0.5)
or 50%.
21
- The precision for a class is equal to the value on the diagonal divided by the total column.
The precision for "Fir tree" is 1/4 (25%), for "Oak tree" is 0/2 (0%) and for "Olive tree" is 3/6
(75%).
- The recall for a class is equal to the value on the diagonal divided by the row total. The
recall for "Fir tree" is ¼ (25%) , for "Oak tree" is 0/4 (0%), for "Olive tree" is 3/4 (75%).
- The overall precision of the system is equal to the average of the class precisions,
(25+0+50)/3, i.e. 25%.
- The overall recall of the system is equal to the average of the class recalls: (25+0+75)/3, i.e.
33.33%.
The presentation of the confusion matrix introduces two other widely used performance
measures: Accuracy and Error rate.
Definition : Accuracy
Accuracy is equal to the number of correct predictions out of the total number of predictions.
(6)
In our example, Accuracy is equal to : 4/12, i.e 33.33%.
Definition : Error rate
The error rate is equal to the number of incorrect predictions out of the total number of
predictions.
(7)
EXERCISES
…….
22