0% found this document useful (0 votes)
58 views13 pages

Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012

This document contains a student's assignments on data mining and analysis using R. It includes 5 questions: 1) matrix operations, 2) analysis of iris data, 3) scatter plots of iris data, 4) linear separability of data, and 5) building a classification model and confusion matrix from medical data. The student provides code, results, and explanations for each question.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views13 pages

Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012

This document contains a student's assignments on data mining and analysis using R. It includes 5 questions: 1) matrix operations, 2) analysis of iris data, 3) scatter plots of iris data, 4) linear separability of data, and 5) building a classification model and confusion matrix from medical data. The student provides code, results, and explanations for each question.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2012

Data Mining R Assignment

Konstantinos Stavrou (70134) [email protected] 11/11/2012

Question 1
The m Matrix used >m [,1] [,2] [,3] [1,] 4 14 34 [2,] 6 22 38 [3,] 10 26 46 a) m[1,] is a row vector and m[,3] is a column vector representing the 1st row and the 3rd column of the matrix. The operation below is a multiplication of those vectors which is done by multiplying each row number with the corresponding column number. > (m[1,]*m[,3]) [1] 136 532 1564 [

] [] = [ ]

The result is an array of 3 elements. The code below executes this multiplication and selects the 3rd element which is the result of (34*46) and it is a plain number. > (m[1,]*m[,3])[3] [1] 1564 b) The expression includes dividing the elements value with the sum of all the other elements of the matrix: > i<-2 > j<-3 > m[i,j]/sum(m) [1] 0.19 Where i and j are assigned the values 2 and 3 respectively denoting that we chose the element from the 2nd row and 3rd column which is number 38. So we get the result from 38/200.

Question 2
a) We can see that setosa has the widest sepals. (3.428) > mean(setosa$setosa.Sepal.Width) [1] 3.428 > mean(versicolor$versicolor.Sepal.Width) [1] 2.77 > mean(virginica$virginica.Sepal.Width) [1] 2.974 b) To conclude I decided to measure the standard deviation for petal width and length and calculate the mean value for each species. According to this virginica has petals that vary the most in general > (sd(setosa$setosa.Petal.Width)+sd(setosa$setosa.Petal.Length))/2 [1] 0.1395248 > (sd(versicolor$versicolor.Petal.Width)+sd(versicolor$versicolor.Petal.Length))/2 [1] 0.3338318 > (sd(virginica$virginica.Petal.Width)+sd(virginica$virginica.Petal.Length))/2 [1] 0.4132724

Another solution would be to use kurtosis. >(kurtosis(setosa$setosa.Petal.Width)+kurtosis(setosa$setosa.Petal.Length))/2 [1] 0.9563241 >(kurtosis(versicolor$versicolor.Petal.Width)+kurtosis(versicolor$versicolor.Petal.Le ngth))/2 [1] -0.388785 >(kurtosis(virginica$virginica.Petal.Width)+kurtosis(virginica$virginica.Petal.Length) )/2 [1] -0.5595373 Value bigger than 0 means that most values are gathered to the center (mean value). Value smaller than 0 means that values are widespread and away from the center. The smaller the value, the bigger is this deviation. This also agrees with our previous statement using the standard deviation.

c) First of all we create 2 variables varP to store petal width values for setosa and varS to store sepal width values. > varP<-setosa$setosa.Petal.Width > varP [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3 0.3 [20] 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 [39] 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 > varS<-setosa$setosa.Sepal.Width > varS [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 [20] 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 [39] 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 Then we run >plot(density(varP)) > x <- seq(0, max(varP)+1, length=200) > y <- dnorm(x, mean=mean(varP), sd=sd(varP)) > lines(x, y, type="l", lwd=2, col="blue")

>plot(density(varS)) > x <- seq(0, max(varS)+1, length=200) > y <- dnorm(x, mean=mean(varS), sd=sd(varS)) > lines(x, y, type="l", lwd=2, col="blue") And we get the plots below. Also important for our analysis are skewness and kurtosis values for each set. > skewness(varP) [1] 1.179633 > kurtosis(varP) [1] 1.258718 And > skewness(varS) [1] 0.03872946 > kurtosis(varS) [1] 0.5959507

We can see that the Petal.Width distribution is far from being a normal distribution. The black line doesnt match the blue one. They are not completely different but still far apart. And the skewness and kurtosis values show that deviation. The distribution is a bit to the left and has a very narrow curve. It is at the say leptokurtic. That means that a lot of values are gathered closely around the mean value. And also they are gathered more to the side of the axis that has lower values.

On the other hand the distribution for the Sepal.Width is quite close to the normal distribution. That is further justified by the skewness and kurtosis values which are much lower than the Sepal.Widths distribution. Still that distribution is not perfectly mesokurtic meaning that this one has a lot of values gathered around the mean value too. Also it leans a bit to the left too.

So we conclude that the setosa Sepal.Width distribution is closer to the normal distribution.

Question 3

As we can see in the scatter plot the univariate feature that provides the best separation is Petal.Width since in every pair diagram with the other features green and blue points are quite apart and easily discriminated. On the other hand we could say that the worst is Sepal.Length since in every pair diagram (except when paired with Petal.Width) points are messed together. In the third diagram they are more separated since they are also affected by the Petal.Width features values which as we described is the best discriminator between the two species.

Question 4
a) As we can see in the figure below we can (manually) draw a line that separates the blue from the red points. So they are linearly separable.

b) The points now are quite close together so its not possible to find a straight line that separates them. But when can draw a curve that can separate them.

c) As we can see in the figure below it is impossible to draw a line that separates perfectly the red from the blue points. Still it is a quite good separator leaving little points in the wrong side. It does not reflect the tendency of the data though because it was manually produced for this single case. If we rerun the code there will be many more points that will be on the other side of the line (and not in the side that they should be). To make a curve that will reflect the tendency of the data better, we should run many iterations of the code and create an intelligent agent , so to say, that will reform the curve in each iteration, in the best possible way. So when the algorithm ends we will have a curve that will be less suitable than the curve in the figure below, for the data of the iteration I run, but more suitable for real life cases where data vary each time.

Question 5
Using the code below we create the chart we need. Green is the negative diagnosis and red is the positive. > plot(diagnosis$feature_a, diagnosis$feature_b, pch=21, bg=c("darkgreen","red")[unclass(diagnosis$diagnosis)]) > abline(-0.9, 1.05)

Looking at the use of function abline I get that the line has the equation y=1,05x-0,9. Since the slope is positive we can detect if: a point is above the line -> feature_b > 1,05*feature_a-0,9 a point is below the line -> feature_b < 1,05*feature_a-0,9

So true positive must have positive diagnosis and be below the line. True negative must have negative diagnosis and be above the line. False positive must have negative diagnosis and be below the line. And false negative must have positive diagnosis and be above the line. So I store them in objects using this code. >TP<-subset(diagnosis, diagnosis == 'positive' & feature_b < (1.05*feature_a-0.9)) > TN<-subset(diagnosis, diagnosis == 'negative' & feature_b > (1.05*feature_a-0.9)) > FP<-subset(diagnosis, diagnosis == 'negative' & feature_b < (1.05*feature_a-0.9)) > FN<-subset(diagnosis, diagnosis == 'positive' & feature_b > (1.05*feature_a-0.9)) By using the nrow() function we can then count how many elements each object has. >nrow(TP) [1] 68 >nrow(TN) [1] 71 >nrow(FP) [1] 4 >nrow(FN) [1] 7 And then we are ready to fill the table: Contingency Table Predicted Positive Predicted Negative Actually Positive 68 7 Actually Negative 4 71

b)

As we can see from the image the point is above the line. That means that we would classify it as a negative diagnosis. But I wouldnt be confident with that decision, because as we can see the point is near the line. Intuitively we can say that the closer you are to the line the possible error gets bigger or else you are less confident about your decision. The line separates the two clusters so points close to that border have more common features with the other cluster, than points that are far from the line. In that sense we could say that we can measure confidence as the distance of a point from the line.

You might also like