Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012

This document contains a student's assignments on data mining and analysis using R. It includes 5 questions: 1) matrix operations, 2) analysis of iris data, 3) scatter plots of iris data, 4) linear separability of data, and 5) building a classification model and confusion matrix from medical data. The student provides code, results, and explanations for each question.

Uploaded by

Konstantinos Stavrou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views13 pages

Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012

Uploaded by

Konstantinos Stavrou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

2012

Data Mining R Assignment

Konstantinos Stavrou (70134) [email protected] 11/11/2012

Question 1
The m Matrix used >m [,1] [,2] [,3] [1,] 4 14 34 [2,] 6 22 38 [3,] 10 26 46 a) m[1,] is a row vector and m[,3] is a column vector representing the 1st row and the 3rd column of the matrix. The operation below is a multiplication of those vectors which is done by multiplying each row number with the corresponding column number. > (m[1,]*m[,3]) [1] 136 532 1564 [

] [] = [ ]

The result is an array of 3 elements. The code below executes this multiplication and selects the 3rd element which is the result of (34*46) and it is a plain number. > (m[1,]*m[,3])[3] [1] 1564 b) The expression includes dividing the elements value with the sum of all the other elements of the matrix: > i<-2 > j<-3 > m[i,j]/sum(m) [1] 0.19 Where i and j are assigned the values 2 and 3 respectively denoting that we chose the element from the 2nd row and 3rd column which is number 38. So we get the result from 38/200.

Question 2
a) We can see that setosa has the widest sepals. (3.428) > mean(setosa$setosa.Sepal.Width) [1] 3.428 > mean(versicolor$versicolor.Sepal.Width) [1] 2.77 > mean(virginica$virginica.Sepal.Width) [1] 2.974 b) To conclude I decided to measure the standard deviation for petal width and length and calculate the mean value for each species. According to this virginica has petals that vary the most in general > (sd(setosa$setosa.Petal.Width)+sd(setosa$setosa.Petal.Length))/2 [1] 0.1395248 > (sd(versicolor$versicolor.Petal.Width)+sd(versicolor$versicolor.Petal.Length))/2 [1] 0.3338318 > (sd(virginica$virginica.Petal.Width)+sd(virginica$virginica.Petal.Length))/2 [1] 0.4132724

Another solution would be to use kurtosis. >(kurtosis(setosa$setosa.Petal.Width)+kurtosis(setosa$setosa.Petal.Length))/2 [1] 0.9563241 >(kurtosis(versicolor$versicolor.Petal.Width)+kurtosis(versicolor$versicolor.Petal.Le ngth))/2 [1] -0.388785 >(kurtosis(virginica$virginica.Petal.Width)+kurtosis(virginica$virginica.Petal.Length) )/2 [1] -0.5595373 Value bigger than 0 means that most values are gathered to the center (mean value). Value smaller than 0 means that values are widespread and away from the center. The smaller the value, the bigger is this deviation. This also agrees with our previous statement using the standard deviation.

c) First of all we create 2 variables varP to store petal width values for setosa and varS to store sepal width values. > varP<-setosa$setosa.Petal.Width > varP [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3 0.3 [20] 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 [39] 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 > varS<-setosa$setosa.Sepal.Width > varS [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 [20] 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 [39] 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 Then we run >plot(density(varP)) > x <- seq(0, max(varP)+1, length=200) > y <- dnorm(x, mean=mean(varP), sd=sd(varP)) > lines(x, y, type="l", lwd=2, col="blue")

>plot(density(varS)) > x <- seq(0, max(varS)+1, length=200) > y <- dnorm(x, mean=mean(varS), sd=sd(varS)) > lines(x, y, type="l", lwd=2, col="blue") And we get the plots below. Also important for our analysis are skewness and kurtosis values for each set. > skewness(varP) [1] 1.179633 > kurtosis(varP) [1] 1.258718 And > skewness(varS) [1] 0.03872946 > kurtosis(varS) [1] 0.5959507

We can see that the Petal.Width distribution is far from being a normal distribution. The black line doesnt match the blue one. They are not completely different but still far apart. And the skewness and kurtosis values show that deviation. The distribution is a bit to the left and has a very narrow curve. It is at the say leptokurtic. That means that a lot of values are gathered closely around the mean value. And also they are gathered more to the side of the axis that has lower values.

On the other hand the distribution for the Sepal.Width is quite close to the normal distribution. That is further justified by the skewness and kurtosis values which are much lower than the Sepal.Widths distribution. Still that distribution is not perfectly mesokurtic meaning that this one has a lot of values gathered around the mean value too. Also it leans a bit to the left too.

So we conclude that the setosa Sepal.Width distribution is closer to the normal distribution.

Question 3

As we can see in the scatter plot the univariate feature that provides the best separation is Petal.Width since in every pair diagram with the other features green and blue points are quite apart and easily discriminated. On the other hand we could say that the worst is Sepal.Length since in every pair diagram (except when paired with Petal.Width) points are messed together. In the third diagram they are more separated since they are also affected by the Petal.Width features values which as we described is the best discriminator between the two species.

Question 4
a) As we can see in the figure below we can (manually) draw a line that separates the blue from the red points. So they are linearly separable.

b) The points now are quite close together so its not possible to find a straight line that separates them. But when can draw a curve that can separate them.

c) As we can see in the figure below it is impossible to draw a line that separates perfectly the red from the blue points. Still it is a quite good separator leaving little points in the wrong side. It does not reflect the tendency of the data though because it was manually produced for this single case. If we rerun the code there will be many more points that will be on the other side of the line (and not in the side that they should be). To make a curve that will reflect the tendency of the data better, we should run many iterations of the code and create an intelligent agent , so to say, that will reform the curve in each iteration, in the best possible way. So when the algorithm ends we will have a curve that will be less suitable than the curve in the figure below, for the data of the iteration I run, but more suitable for real life cases where data vary each time.

Question 5
Using the code below we create the chart we need. Green is the negative diagnosis and red is the positive. > plot(diagnosis$feature_a, diagnosis$feature_b, pch=21, bg=c("darkgreen","red")[unclass(diagnosis$diagnosis)]) > abline(-0.9, 1.05)

Looking at the use of function abline I get that the line has the equation y=1,05x-0,9. Since the slope is positive we can detect if: a point is above the line -> feature_b > 1,05*feature_a-0,9 a point is below the line -> feature_b < 1,05*feature_a-0,9

So true positive must have positive diagnosis and be below the line. True negative must have negative diagnosis and be above the line. False positive must have negative diagnosis and be below the line. And false negative must have positive diagnosis and be above the line. So I store them in objects using this code. >TP<-subset(diagnosis, diagnosis == 'positive' & feature_b < (1.05*feature_a-0.9)) > TN<-subset(diagnosis, diagnosis == 'negative' & feature_b > (1.05*feature_a-0.9)) > FP<-subset(diagnosis, diagnosis == 'negative' & feature_b < (1.05*feature_a-0.9)) > FN<-subset(diagnosis, diagnosis == 'positive' & feature_b > (1.05*feature_a-0.9)) By using the nrow() function we can then count how many elements each object has. >nrow(TP) [1] 68 >nrow(TN) [1] 71 >nrow(FP) [1] 4 >nrow(FN) [1] 7 And then we are ready to fill the table: Contingency Table Predicted Positive Predicted Negative Actually Positive 68 7 Actually Negative 4 71

As we can see from the image the point is above the line. That means that we would classify it as a negative diagnosis. But I wouldnt be confident with that decision, because as we can see the point is near the line. Intuitively we can say that the closer you are to the line the possible error gets bigger or else you are less confident about your decision. The line separates the two clusters so points close to that border have more common features with the other cluster, than points that are far from the line. In that sense we could say that we can measure confidence as the distance of a point from the line.

BN2102 1-6 Notes
No ratings yet
BN2102 1-6 Notes
38 pages
R Studio Cheat Sheet For Math1041
No ratings yet
R Studio Cheat Sheet For Math1041
3 pages
Basic Statistics 1
100% (2)
Basic Statistics 1
12 pages
Design An Airline Management System
No ratings yet
Design An Airline Management System
9 pages
Using R For Data Preprocessing, Exploratory Analysis, Visualization
No ratings yet
Using R For Data Preprocessing, Exploratory Analysis, Visualization
7 pages
ML R Experiment1
No ratings yet
ML R Experiment1
10 pages
LAB TEST (1)
No ratings yet
LAB TEST (1)
7 pages
Intro To Analytics Modeling Homework 2
No ratings yet
Intro To Analytics Modeling Homework 2
22 pages
Introduction To R. Graphical Representation of Multivariate Observations
No ratings yet
Introduction To R. Graphical Representation of Multivariate Observations
5 pages
Nandini_matplotlib_ws
No ratings yet
Nandini_matplotlib_ws
10 pages
EDA With R Lab Manual
No ratings yet
EDA With R Lab Manual
110 pages
R Console
No ratings yet
R Console
6 pages
DAafgfaga
No ratings yet
DAafgfaga
22 pages
R Programs
No ratings yet
R Programs
30 pages
R Programming: 122AD0029 - T.MANISH
No ratings yet
R Programming: 122AD0029 - T.MANISH
21 pages
Classification Using R
No ratings yet
Classification Using R
9 pages
Useful R Commands
No ratings yet
Useful R Commands
17 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
shahun term workR1
No ratings yet
shahun term workR1
34 pages
Genetica Cuantitativa
No ratings yet
Genetica Cuantitativa
120 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
All Lectures
No ratings yet
All Lectures
53 pages
R Practice
No ratings yet
R Practice
38 pages
STA3022Test2 2023 v2
No ratings yet
STA3022Test2 2023 v2
6 pages
Summarizing Data
No ratings yet
Summarizing Data
13 pages
R Complete
No ratings yet
R Complete
24 pages
Data Science Project
No ratings yet
Data Science Project
31 pages
Data Mining Tutorial: D. A. Dickey
No ratings yet
Data Mining Tutorial: D. A. Dickey
109 pages
2020-02-22 Linear Models
No ratings yet
2020-02-22 Linear Models
54 pages
Lab Manual _DSR
No ratings yet
Lab Manual _DSR
32 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
28 pages
Confidence interval and credintial interval
No ratings yet
Confidence interval and credintial interval
15 pages
Statdescr
No ratings yet
Statdescr
23 pages
330 Lecture9 2014
No ratings yet
330 Lecture9 2014
40 pages
Data Exploration and Visualisation With R: Yanchang Zhao
No ratings yet
Data Exploration and Visualisation With R: Yanchang Zhao
45 pages
EDA AnalysisA
No ratings yet
EDA AnalysisA
15 pages
Unit 3
No ratings yet
Unit 3
11 pages
W11_exercisesolutions
No ratings yet
W11_exercisesolutions
6 pages
Ds Practical
No ratings yet
Ds Practical
25 pages
R-Course
No ratings yet
R-Course
64 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Analysis Course HW1
No ratings yet
Analysis Course HW1
5 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
ML#07
No ratings yet
ML#07
21 pages
Merging and Importing Data Additionalmaterial
No ratings yet
Merging and Importing Data Additionalmaterial
2 pages
1.6
No ratings yet
1.6
75 pages
Anuj Khandelwal 3029 BCP a Business Analytics Continuous Assessment 2
No ratings yet
Anuj Khandelwal 3029 BCP a Business Analytics Continuous Assessment 2
20 pages
L5
No ratings yet
L5
29 pages
Aashish Yadav Stats Final Practical
No ratings yet
Aashish Yadav Stats Final Practical
41 pages
Day 3
No ratings yet
Day 3
19 pages
as
No ratings yet
as
22 pages
STATISTICALinference
No ratings yet
STATISTICALinference
5 pages
IRIS Commands Practice
No ratings yet
IRIS Commands Practice
10 pages
Basics of Data Analysis and Graphics In
No ratings yet
Basics of Data Analysis and Graphics In
103 pages
Questions With No Solutions
No ratings yet
Questions With No Solutions
20 pages
STAT456 Study Guide
No ratings yet
STAT456 Study Guide
31 pages
WEEK
No ratings yet
WEEK
17 pages
Survival Analysis Practical
No ratings yet
Survival Analysis Practical
22 pages
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced Design - Introduction
No ratings yet
Advanced Design - Introduction
20 pages
Book of Abstracts LUMEN Conference 2012
No ratings yet
Book of Abstracts LUMEN Conference 2012
264 pages
Elf - Sporti 9 C2C3 5W-30 - GC3 - 201708 - en
No ratings yet
Elf - Sporti 9 C2C3 5W-30 - GC3 - 201708 - en
1 page
FLUID MECHANICS Buoyancy
0% (1)
FLUID MECHANICS Buoyancy
19 pages
Parenteral Injections 1
No ratings yet
Parenteral Injections 1
11 pages
Untitled
No ratings yet
Untitled
11 pages
Anglu 4 Testas 2var 1
No ratings yet
Anglu 4 Testas 2var 1
6 pages
Contract (Contractor Subcontractor)
No ratings yet
Contract (Contractor Subcontractor)
6 pages
Multidimensional Integration: 2.1 Line Integrals
No ratings yet
Multidimensional Integration: 2.1 Line Integrals
37 pages
Management Response To IAIG 2009 Activity Report
No ratings yet
Management Response To IAIG 2009 Activity Report
7 pages
Functional Specification Diesel Fuel System PDF
100% (1)
Functional Specification Diesel Fuel System PDF
5 pages
Final Quiz 1 - Attempt Review
No ratings yet
Final Quiz 1 - Attempt Review
6 pages
SQL 11
No ratings yet
SQL 11
14 pages
Developing Spreadsheet-Based Decision Support Systems: Michelle MH Eref
0% (2)
Developing Spreadsheet-Based Decision Support Systems: Michelle MH Eref
13 pages
The Role of Smart Contracts in Construction Law
No ratings yet
The Role of Smart Contracts in Construction Law
8 pages
15 Feb 2024 Current Affairs - English - 30534471 - 2024 - 02 - 15 - 16 - 47
No ratings yet
15 Feb 2024 Current Affairs - English - 30534471 - 2024 - 02 - 15 - 16 - 47
14 pages
Hemlata 4
No ratings yet
Hemlata 4
292 pages
Lec 15 Multi VDD
No ratings yet
Lec 15 Multi VDD
17 pages
ECE TRANS WP.29 GRVA 2024 33e - 0
No ratings yet
ECE TRANS WP.29 GRVA 2024 33e - 0
2 pages
Aror University of Art, Architecture, Design & Heritage, Sukkur Document Checklist - Cover Sheet
No ratings yet
Aror University of Art, Architecture, Design & Heritage, Sukkur Document Checklist - Cover Sheet
3 pages
284-Article Text-446-1-10-20181225
No ratings yet
284-Article Text-446-1-10-20181225
17 pages
Cambridge Exam-Sample Paper 1
No ratings yet
Cambridge Exam-Sample Paper 1
19 pages
Effects of Saturn in All The Houses Written by Shri Yogeshwaranand Ji
89% (66)
Effects of Saturn in All The Houses Written by Shri Yogeshwaranand Ji
15 pages
Screenshot 2024-03-03 at 9.35.35 PM
No ratings yet
Screenshot 2024-03-03 at 9.35.35 PM
1 page
Graphic Organizers
No ratings yet
Graphic Organizers
9 pages
The New Rules of Sales Enablement
No ratings yet
The New Rules of Sales Enablement
34 pages
Images of Dignity Barry Barclay and Four PDF
No ratings yet
Images of Dignity Barry Barclay and Four PDF
122 pages
PLMSC Clinical Laboratory Price Lists July 2022
No ratings yet
PLMSC Clinical Laboratory Price Lists July 2022
2 pages
Template For Opcrf of School Heads 1
95% (19)
Template For Opcrf of School Heads 1
29 pages