Unit 4- r Programming
Unit 4- r Programming
REGRESSION
Definition : Regression analysis is a data analysis technique that predicts the value of an
unknown variable by using another variable. It helps in understanding the relationships
between two quantitative variables. One variable, denoted x, is called as the predictor, the
other variable, denoted y, is called the response, outcome, or dependent variable.
Two kinds of linear regression models:
a) Simple linear regression
b) Multiple linear regression
A simple linear regression is used to model the relationship between one independent and
one dependent variable. The goal of simple linear regression is to find the line of best fit
through the data points.
The equation for a simple linear regression model is: Y=a+bX+u
where: Y = dependent variable,
X = independent variable used to predict Y,
a = intercept,
b = slope, and
u = residual or the error term.
Example: We can use simple linear regression to model the relationship between two
variables height and weight of a person. We can calculate the weight if we know their height:
Weight a+b height
where a = intercept
The slope measures the change of weight with respect to the height, in general, for every
centimeter the person grows, his or her weight will increase with "b" The residual also
known as error term.
Residuals
A good way to test the quality of the fit of the model is to look at the residuals or the
differences between the real values and the predicted values.
Goodness of Fit
The best fit line is the one that minimizes the sum of squared differences between actual and
predicted results.Fig (1) depicts a linear regression model with good fit. Fig (2) depicts a
model with poor goodness of fit, that is, the difference between observed values and
predicted values is large.
MODEL WITH GOOD FIT
a) Plot customization
Plot customization refers to the process of modifying the appearance of a plot to understand
the conveyed information in a better manner, There are many ways to customize a plot,
Changing colors: You can change the color of lines, markers, and other elements in a
plot to make them more visually appealing.
Adding labels and titles: You can add labels to the x-axis, y-axis, and other parts of
the pist to provide context for the data.
Adjusting axes limits: You can adjust the limits of the x-axis and y-axis to zoom in os
specific parts of the data or to change the scale of the plot.
Changing line styles: You can change the style of lines in a plot, such as making them
dashed or dotted, to differentiate between multiple lines in the same plot.
Adding annotations: You can add text or arrows to a plot to highlight specific
features or to provide additional information about the data.
b) Plotting regions and margins
Point and click coordinate interaction" refers to the ability to act with a plot by
clicking on specific points or regions. This type of interaction can be used highlight
specific features of the data.
For example, in R, the plotly package provides an interactive plotting interface that
allows users to zoom in and out of plots , hover over data points to see additional
information, and click on data points to highlight them.
Output
2) Using Color Names to Change Plot Color in R
R programming has names for 657 colors
colors() function is being used
# display all color names using colors()
Output
[1] "white" "aliceblue" "antiquewhite"
[4] "antiquewhite1" "antiquewhite2" "antiquewhite3"
……..
[655] "yellow3" "yellow4" "yellowgreen"
Here, the colors() function returns a vector of all the color names in alphabetical order
with the first element being "white".
The rgb() function in R allows us to specify red, green and blue components with a
number between 0 and 1.
This function returns the corresponding hex code discussed above. For example,
rgb(0, 1, 0) # prints "#00FF00"
Eg : temp <- c(5,7,6,4,8)
barplot(temp, col = rgb(0.3, 0.7, 0.9), main="Using RGB Values")
Output
5) Color Cycling in R
We can color each bar of the barplot with a different color by providing a vector of
colors.
If the number of colors provided is less than the number of bars, the color vector is
recycled.
Eg : temp <- c(5,7,6,4,8)
barplot(temp, col=c("red", "coral", "blue", "yellow", "pink"), main="With 5 Colors")
barplot(temp, col=c("red", "coral", "blue"), main="With 3 Color")
Output
In the above example, at first we colored each bar of the barplot by providing a vector
with 5 colors for 5 different bars.
For the second barplot, we have provided a vector with 3 different colors, so the color
is recycled for the last 2 bars.
3D SCATTERPLOTS
Definition : 3D scatter plots are a great way to visualize data in three dimensions. They can
be created in R using scatterplot3d package which provides functions to create interactive 3D
scatter plots.
1) Install and Load the Required Libraries
install.packages("scatterplot3d")
library(scatterplot3d)
2) Prepare the dataset : The iris dataset will be used.
data(iris)
head(iris)
Note : iris data set gives the measurements of the variables sepal length and width, petal
length and width, respectively.
3) The function scatterplot3d()
scatterplot3d(x, y=NULL, z=NULL) where x, y, z are the coordinates of points to be
plotted.
4) Change the angle of point view
scatterplot3d(iris[,1:3], angle = 55)
5) Change the main title and axis labels
scatterplot3d(iris[,1:3],
main="3D Scatter Plot",
xlab = "Sepal Length (cm)",
ylab = "Sepal Width (cm)",
zlab = "Petal Length (cm)")
6) Change the shape and the color of points
The argument pch and color can be used:
scatterplot3d(iris[,1:3], pch = 16, color="steelblue")
7) Change point shapes by groups
shapes = c(16, 17, 18)
shapes <- shapes[as.numeric(iris$Species)]
scatterplot3d(iris[,1:3], pch = shapes)
pch – determines the shape, here pch=16 (circle), you can change shapes like
triangle, rectangle etc.
8) Change point colors by groups
colors <- c("#999999", "#E69F00", "#56B4E9")
colors <- colors[as.numeric(iris$Species)]
scatterplot3d(iris[,1:3], pch = 16, color=colors)
1. Probability Sampling
Definition : Every individual or item in the population has a known, non-zero chance of being
selected.
2. Non-Probability Sampling
Individuals are selected based on specific characteristics or convenience rather than random
selection.
Convenience sampling
Quota sampling
Snowball sampling
Purposive sampling
1. Convenience Sampling
Technique: Participants are selected based on availability or ease of access,
making it a fast and easy sampling method.
Example: A psychology student surveys classmates because they are easily
accessible and available for quick data collection.
2. Quota Sampling
Technique: The population is divided into categories (e.g., age, gender), and a
specified number of participants from each category is chosen non-randomly.
Example: A researcher studying consumer preferences might set a quota to
survey 50 men and 50 women in a shopping mall.
3. Snowball Sampling
Technique: Participants recruit other participants, making it useful for
studying hard-to-reach populations.
Example: In a study on experiences of ex-convicts, initial participants refer
other ex-convicts they know, expanding the sample.
4. Purposive Sampling
Technique: Participants are selected based on specific criteria or
characteristics relevant to the study’s purpose.
Example: In a study on the effects of leadership training, a researcher selects
participants who hold managerial positions to gain insights specific to leaders.
When to Use Each Sampling Method
1. Simple Random Sampling: Use when you need a fully representative sample,
especially if the population is homogeneous and a sampling frame is available.
2. Stratified Sampling: Best when studying specific subgroups within a population, as it
ensures representation across key characteristics.
3. Systematic Sampling: Suitable when you have a large population list and need a simple
yet systematic approach, especially if the list has no inherent order.
4. Cluster Sampling: Useful for large, geographically dispersed populations; ideal when
it’s impractical to survey individuals directly.
5. Convenience Sampling: Ideal for exploratory studies, pilot tests, or when time and
resources are limited.
6. Quota Sampling: Use when studying demographic or categorical diversity, especially
when you need specific representation within the sample.
7. Snowball Sampling: Ideal for reaching hidden, hard-to-reach, or marginalized
populations.
8. Purposive Sampling: Best when studying a specific, well-defined population or a
unique group that directly relates to the research question.
Defining Hypotheses
Null Hypothesis (H₀): The starting assumption. For example, "The average visits are 50."
Alternative Hypothesis (H₁): The opposite, saying there is a difference. For example, "The
average visits are not 50."
Types of Hypothesis Testing
1. One-Tailed Test
Used when we expect a change in only one direction either up or down, but not both. For example,
if testing whether a new algorithm improves accuracy, we only check if accuracy increases.
There are two types of one-tailed test:
Left-Tailed (Left-Sided) Test: Checks if the value is less than expected.
Right-Tailed (Right-Sided) Test: Checks if the value is greater than expected.
2. Two-Tailed Test
Used when we want to see if there is a difference in either direction higher or lower. For example,
testing if a marketing strategy affects sales, whether it goes up or down
What are Type 1 and Type 2 errors in Hypothesis Testing?
Type I error: When we reject the null hypothesis although that hypothesis was true. Type I
error is denoted by alpha(αα).
Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted
by beta(ββ).
Null Hypothesis is True Null Hypothesis is False
Row
Tea Coffee
Total
Male 20 30 50
Female 30 20 50
Step-by-Step Implementation
Compare the test statistic to the critical value from the chi-squared distribution table at α = 0.05 and
df = 1:
Step 6: Conclusion
There is statistically significant evidence to suggest that gender and drink preference are not
independent.
ANOVA also known as Analysis of variance is used to investigate relations between categorical
variables and continuous variables in the R programming language.
R - ANOVA Test
Null Hypothesis: The statement that two or more groups are equal or that the effect size is
zero is sometimes expressed as the null hypothesis. The null hypothesis is commonly written
as H0.
Alternate Hypothesis: The opposite of the null hypothesis is the alternative hypothesis.
Alternative hypotheses are sometimes referred to as H1 or HA.
UNIT - 1
TIMINGS IN R
It is the amount of time it takes for a particular operation ,function or set of operations
to execute in R code.
TIMING FUNCTIONS
1) Sys.time() – is used to determine current system time and can be called multiple
times to calculate elapsed time.
2) System.time() – returns the amount of CPU time used by the R process
3) Proc.time() – returns the amount of CPU time used.It is useful for measuring the
time taken by a sequence of expressions or function calls.
OPTIMIZING TIMING
VISIBILITY
It refers to the accessibility of objects (variables, functions and data) within different
scopes or environment.
a) Global scope : Objects defined in the global environment (top level) are
accessible from any part of your R script .
b) Local scope : Objects defined within a specific function are only accessible
within the functions.
Lexical Scoping in R
R uses lexical scoping (also called static scoping), meaning that the scope of a variable is determined by
where it is defined in the code, not where it is called.
VISIBILITY RULES
SCOPING FUNCTIONS
R provides functions like ls(), ls.str() and objects() that help you list and examine
objects within a specific environment.
Use get() to access global objects by name, even from within a function’s local
scope.
2. HISTOGRAM
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. .
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
Parameters:
v: This parameter contains numerical values used in histogram.
main: title of the chart.
col: Used to set color of the bars.
xlab: label for horizontal axis.
border: Used to set border color of each bar.
xlim: Used for plotting values of x-axis.
ylim: Used for plotting values of y-axis.
breaks: Used as width of each bar.
3. BOX PLOTS
A boxplot (also known as a box-and-whisker plot) is used to visualize the distribution of data
based on five key statistics (minimum, first quartile (Q1), median, third quartile (Q3), and
maximum).
Syntax: boxplot(x, data, notch, varwidth, names, main)
Parameters:
x: This parameter sets as a vector or a formula.
data: This parameter sets the data frame.
notch: This parameter is the label for horizontal axis.
varwidth: This parameter is a logical value. Set as true to draw width of the box
proportionate to the sample size.
main: This parameter is the title of the chart.
names: This parameter are the group labels that will be showed under each boxplot.
4. SCATTER PLOTS
A scatter plot is a set of dotted points representing individual data pieces on the horizontal and
vertical axis.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Parameters:
x: Sets the horizonal coordinates.
y: Sets the vertical coordinates.
xlab: label for horizontal axis.
ylab: label for vertical axis.
main: title of the chart.
xlim: This is used for plotting values of x.
ylim: This is used for plotting values of y.
axes: This indicates whether both axes should be drawn on the plot.
5. LINE PLOT
A line graph is a chart that is used to display information in the form of a series of data points.
R – Line Graphs
The plot() function in R is used to create the line graph.
Syntax: plot(v, type, col, xlab, ylab)
Parameters:
v: This parameter is a contains only the numeric values
type: This parameter has the following value: