0% found this document useful (0 votes)
3 views26 pages

Unit 4- r Programming

The document provides an overview of linear regression, including definitions, types (simple and multiple), and the importance of residuals and goodness of fit. It also discusses advanced graphics for data visualization, model selection and diagnostics, and various methods for customizing plots in R. Additionally, it covers statistical tests and sampling methods, highlighting the objectives and types of probability sampling.

Uploaded by

rekha.prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views26 pages

Unit 4- r Programming

The document provides an overview of linear regression, including definitions, types (simple and multiple), and the importance of residuals and goodness of fit. It also discusses advanced graphics for data visualization, model selection and diagnostics, and various methods for customizing plots in R. Additionally, it covers statistical tests and sampling methods, highlighting the objectives and types of probability sampling.

Uploaded by

rekha.prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT 4

LINEAR REGRESSION AND PLOTTINGS

REGRESSION
Definition : Regression analysis is a data analysis technique that predicts the value of an
unknown variable by using another variable. It helps in understanding the relationships
between two quantitative variables. One variable, denoted x, is called as the predictor, the
other variable, denoted y, is called the response, outcome, or dependent variable.
Two kinds of linear regression models:
a) Simple linear regression
b) Multiple linear regression
A simple linear regression is used to model the relationship between one independent and
one dependent variable. The goal of simple linear regression is to find the line of best fit
through the data points.
The equation for a simple linear regression model is: Y=a+bX+u
where: Y = dependent variable,
X = independent variable used to predict Y,
a = intercept,
b = slope, and
u = residual or the error term.
Example: We can use simple linear regression to model the relationship between two
variables height and weight of a person. We can calculate the weight if we know their height:
Weight a+b height
where a = intercept
The slope measures the change of weight with respect to the height, in general, for every
centimeter the person grows, his or her weight will increase with "b" The residual also
known as error term.

Residuals
A good way to test the quality of the fit of the model is to look at the residuals or the
differences between the real values and the predicted values.
Goodness of Fit
The best fit line is the one that minimizes the sum of squared differences between actual and
predicted results.Fig (1) depicts a linear regression model with good fit. Fig (2) depicts a
model with poor goodness of fit, that is, the difference between observed values and
predicted values is large.
MODEL WITH GOOD FIT

MODEL WITH POOR GOODNESS OF FIT


MULTIPLE LINEAR REGRESSION
Definition : It is used to model the relationship between more than one independent variable
and one dependent variable.
The goal of multiple linear regression is to find the line of best fit through the datapoints.
The equation for a multiple linear regression model is :
Y=a+b1X1+b2X2+b3X3+……….+bnXn+u
Where Y= dependent variable
X1,X2…..= independent variables used to predict Y
a= intercept
b= slope
u = residual or error term
Example : We can use multiple linear regression to model the relationship between the
variables: height, age, and weight of a person
Weight = a+ b1 * height + b2 * age
If they ask for 6 marks question, draw a table for height and weight for different ages
approximately and draw the related line graph for the table.

Linear model selection and diagnostics


Linear model selection and diagnostics is a process of selecting the best linear model for a
given set of data and checking the assumptions of the model. The process involves several
steps:
1) Model specification: Choosing the independent variables to include in the model and
deciding on the functional form of the model.
2) Model estimation: Estimating the parameters of the model using a suitable method
such as least squares.
3) Model evaluation: Evaluating the goodness of fit of the model using various
measures such as R-squared, adjusted R-squared, and residual plots.
4) Model selection: Selecting the best model from a set of candidate models using
various criteria such as AIC, BIC, and cross-validation.
5) Model diagnostics: Checking the assumptions of the model such as linearity,
normality, homoscedasticity, and independence of errors.
Advanced graphics
Advanced graphics in statistical models refer to the use of graphical techniques to visualize
and analyse data. These techniques can help identify patterns, trends, and relationships in the
data that may not be depicted through numerical summaries.
Example:
 Residual plots: These plots are used to check the assumptions of a linear regression
model. They show the difference between the observed values of the dependent
variable and the predicted values from the model.
 Diagnostic plots: These plots are used to diagnose problems with a statistical as non-
normality or heteroscedasticity of residuals.
 Heatmaps: These plots are used to visualize large datasets with many variables They
show the correlation between pairs of variables using a color scale.
 Tree diagrams: These plots are used to visualize decision trees, which are models that
predict the value of a dependent variable based on several independent variable.

a) Plot customization
Plot customization refers to the process of modifying the appearance of a plot to understand
the conveyed information in a better manner, There are many ways to customize a plot,
 Changing colors: You can change the color of lines, markers, and other elements in a
plot to make them more visually appealing.

 Adding labels and titles: You can add labels to the x-axis, y-axis, and other parts of
the pist to provide context for the data.
 Adjusting axes limits: You can adjust the limits of the x-axis and y-axis to zoom in os
specific parts of the data or to change the scale of the plot.
 Changing line styles: You can change the style of lines in a plot, such as making them
dashed or dotted, to differentiate between multiple lines in the same plot.
 Adding annotations: You can add text or arrows to a plot to highlight specific
features or to provide additional information about the data.
b) Plotting regions and margins

i) R circle package() : This package provides a flexible function called


chordDiagram() that can be used to create circular plots with links inside.
ii) Stata : You can customize the margins of your plot using margin() option.
iii) Python Matplotlib : You can customize the margins of your plot using
subplots_adjust().

c) Point and click coordinate interaction

 Point and click coordinate interaction" refers to the ability to act with a plot by
clicking on specific points or regions. This type of interaction can be used highlight
specific features of the data.
 For example, in R, the plotly package provides an interactive plotting interface that
allows users to zoom in and out of plots , hover over data points to see additional
information, and click on data points to highlight them.

d) Specialized text and label notation


 Greek letters: Greek letters such as a, ẞ, and y can be used to represent mathematical
symbols or parameters in a plot
 Subscripts and superscripts: Subscripts and superscripts can be used to represent
indices or exponents in a plot. For example, xi represents the i th element of a vector
x, and y2 represents the square of a variable y.
 Mathematical expressions: Mathematical expressions such as fractions, integrals,
and summations can be used to represent complex mathematical relationships in a plot
 Units of measurement: Units of measurement such as meters, seconds, and degrees
can be added to labels to provide context for the data.

STEPS TO ANNOTATE TEXT AND LABELS IN PLOTS


1. Load Required Libraries
2. Create a Plot
3. Identify Annotation Points
4. Select Appropriate Annotation Function
5. Specify Annotation Parameters
 label: The text to be displayed
 x and y: The coordinates where the annotation should be placed.
 hjust and vjust: Horizontal and vertical justification for text alignment.angle.
 color and size: Text color and size.
 line type, color, size: For lines or arrows.
6. Integrate Annotations into Plot
7. Fine-Tune Annotations
8. Combine Multiple Annotations
9. Label Dynamic Elements
10. Render and Export the Plot
11. Iterate and Refine
Example 1: Annotating a Histogram with Mean and Standard Deviation

Example 2: Annotating a Pie Chart with Percentage Labels

Example 3: Create an attractive text label visualization with labels


Example 4 : Box Plot with Outlier Annotations

DEFINING COLORS AND PLOTTING IN HIGHER DIMENSIONS/


REPRESENTING AND USING COLORS
We can visually improve our plots by coloring them. This is generally done with
the col graphical parameter.
1) Add Color to Plot in R
We use the following temp vector to create a barplot.
Eg : temp <- c(5,7,6,4,8)
barplot(temp, main="By default")
barplot(temp, col="coral", main="With coloring")

Output
2) Using Color Names to Change Plot Color in R
 R programming has names for 657 colors
 colors() function is being used
# display all color names using colors()
Output
[1] "white" "aliceblue" "antiquewhite"
[4] "antiquewhite1" "antiquewhite2" "antiquewhite3"
……..
[655] "yellow3" "yellow4" "yellowgreen"
 Here, the colors() function returns a vector of all the color names in alphabetical order
with the first element being "white".

3) Using Hex Values as Colors in R

 In R, color can also be defined with a hexadecimal value.


 We define a color as a 6 hexadecimal digit number from 00 to FF.
 For example, #FF0000 would be red and #FFFFFF would
be white and #000000 would be black.
Eg : temp <- c(5,7,6,4,8)
barplot(temp, col="#c00000", main="#c00000")
barplot(temp, col="#AE4371", main="#AE4371")
.Output

4) Using RGB Values to Color Plot in R

 The rgb() function in R allows us to specify red, green and blue components with a
number between 0 and 1.
 This function returns the corresponding hex code discussed above. For example,
rgb(0, 1, 0) # prints "#00FF00"
Eg : temp <- c(5,7,6,4,8)
barplot(temp, col = rgb(0.3, 0.7, 0.9), main="Using RGB Values")
Output

5) Color Cycling in R
 We can color each bar of the barplot with a different color by providing a vector of
colors.
 If the number of colors provided is less than the number of bars, the color vector is
recycled.
Eg : temp <- c(5,7,6,4,8)
barplot(temp, col=c("red", "coral", "blue", "yellow", "pink"), main="With 5 Colors")
barplot(temp, col=c("red", "coral", "blue"), main="With 3 Color")
Output

 In the above example, at first we colored each bar of the barplot by providing a vector
with 5 colors for 5 different bars.
 For the second barplot, we have provided a vector with 3 different colors, so the color
is recycled for the last 2 bars.

6) Using Color Palette in R


 R programming offers 4 built in color palettes which can be used to quickly generate
color vectors of desired length.
 They are: rainbow(), heat.colors(), terrain.colors(), and topo.colors().
# use rainbow() to generate color palette
rainbow(5)
# Output: "#FF0000FF" "#CCFF00FF" "#00FF66FF" "#0066FFFF" "#CC00FFFF"
Here, notice that the hexadecimal numbers are 8 digits long. The last two digits are the
transparency level with FF being opaque and 00 being fully transparent.
Example: Using Color Palette in R
temp <- c(5,7,6,4,8)
barplot(temp, col=rainbow(5), main="rainbow")
barplot(temp, col=heat.colors(5), main="heat.colors")
barplot(temp, col=terrain.colors(5), main="terrain.colors")
barplot(temp, col=topo.colors(5), main="topo.colors")
Output
Here, we have used 4 built in color palettes which can be used to quickly generate color
vectors of desired length.

3D SCATTERPLOTS
Definition : 3D scatter plots are a great way to visualize data in three dimensions. They can
be created in R using scatterplot3d package which provides functions to create interactive 3D
scatter plots.
1) Install and Load the Required Libraries
install.packages("scatterplot3d")
library(scatterplot3d)
2) Prepare the dataset : The iris dataset will be used.
data(iris)
head(iris)
Note : iris data set gives the measurements of the variables sepal length and width, petal
length and width, respectively.
3) The function scatterplot3d()
scatterplot3d(x, y=NULL, z=NULL) where x, y, z are the coordinates of points to be
plotted.
4) Change the angle of point view
scatterplot3d(iris[,1:3], angle = 55)
5) Change the main title and axis labels
scatterplot3d(iris[,1:3],
main="3D Scatter Plot",
xlab = "Sepal Length (cm)",
ylab = "Sepal Width (cm)",
zlab = "Petal Length (cm)")
6) Change the shape and the color of points
The argument pch and color can be used:
scatterplot3d(iris[,1:3], pch = 16, color="steelblue")
7) Change point shapes by groups
shapes = c(16, 17, 18)
shapes <- shapes[as.numeric(iris$Species)]
scatterplot3d(iris[,1:3], pch = shapes)
 pch – determines the shape, here pch=16 (circle), you can change shapes like
triangle, rectangle etc.
8) Change point colors by groups
colors <- c("#999999", "#E69F00", "#56B4E9")
colors <- colors[as.numeric(iris$Species)]
scatterplot3d(iris[,1:3], pch = 16, color=colors)

9) Remove the box around the plot


scatterplot3d(iris [1:3], pch = 16, color = colors, grid=TRUE, box=FALSE)

STATISTICAL TEST DEFINITION


A statistical test is a procedure for deciding whether an assertion about a quantitative feature
of a population is true or false.
SAMPLING DISTRIBUTION DEFINITION
It is a way in which the probability distribution of a sample is drawn from a much larger
population.
SAMPLING DEFINITION
Sampling is the process of selecting a subset of individuals, items, or data points from a
larger population to analyse and draw conclusions about the entire population.

Key Objectives of Sampling:


1. Reduce Costs and Time: Sampling allows for efficient data collection by focusing on a
representative subset.
2. Improve Accuracy: Smaller, well-designed samples can lead to more accurate, focused
data collection.
3. Ensure Representativeness: By carefully selecting a sample, researchers can ensure
that the findings are relevant to the larger population.

Types of Sampling Methods

Sampling methods can be broadly classified into two categories:


1) Probability sampling
2) Non-probability sampling.

1. Probability Sampling
Definition : Every individual or item in the population has a known, non-zero chance of being
selected.

Types of Probability Sampling:


 Simple random sampling
 Stratified sampling
 Systematic sampling
 Cluster sampling

1. Simple Random Sampling


 Technique: Each individual in the population has an equal chance of being
selected. Researchers use random number generators or random selection tools
to choose participants.
 Example: A school administrator randomly selects 50 students from a list of
all students to survey about cafeteria satisfaction.
2. Stratified Sampling
 Technique: The population is divided into subgroups (strata) based on a
characteristic (e.g., age, gender), and random samples are taken from each
subgroup.
 Example: In a study on employee satisfaction, researchers divide employees
into departments (e.g., sales, HR, finance) and randomly select employees
from each department.
3. Systematic Sampling
 Technique: A starting point is randomly selected, and then every kth
individual is chosen from a list. This method is often used when there’s a fixed
pattern or order in the population list.
 Example: A researcher wants to survey a population of 1,000 people and
decides to select every 10th person on a sorted list after a random start.
4. Cluster Sampling
 Technique: The population is divided into clusters (groups) that are randomly
selected. All individuals within selected clusters are then included in the
sample.
 Example: In a national health study, a researcher randomly selects specific
cities (clusters) and surveys all residents within those cities.

2. Non-Probability Sampling
Individuals are selected based on specific characteristics or convenience rather than random
selection.

Types of Non-Probability Sampling:

 Convenience sampling
 Quota sampling
 Snowball sampling
 Purposive sampling

1. Convenience Sampling
 Technique: Participants are selected based on availability or ease of access,
making it a fast and easy sampling method.
 Example: A psychology student surveys classmates because they are easily
accessible and available for quick data collection.
2. Quota Sampling
 Technique: The population is divided into categories (e.g., age, gender), and a
specified number of participants from each category is chosen non-randomly.
 Example: A researcher studying consumer preferences might set a quota to
survey 50 men and 50 women in a shopping mall.
3. Snowball Sampling
 Technique: Participants recruit other participants, making it useful for
studying hard-to-reach populations.
 Example: In a study on experiences of ex-convicts, initial participants refer
other ex-convicts they know, expanding the sample.
4. Purposive Sampling
 Technique: Participants are selected based on specific criteria or
characteristics relevant to the study’s purpose.
 Example: In a study on the effects of leadership training, a researcher selects
participants who hold managerial positions to gain insights specific to leaders.
When to Use Each Sampling Method
1. Simple Random Sampling: Use when you need a fully representative sample,
especially if the population is homogeneous and a sampling frame is available.
2. Stratified Sampling: Best when studying specific subgroups within a population, as it
ensures representation across key characteristics.
3. Systematic Sampling: Suitable when you have a large population list and need a simple
yet systematic approach, especially if the list has no inherent order.
4. Cluster Sampling: Useful for large, geographically dispersed populations; ideal when
it’s impractical to survey individuals directly.
5. Convenience Sampling: Ideal for exploratory studies, pilot tests, or when time and
resources are limited.
6. Quota Sampling: Use when studying demographic or categorical diversity, especially
when you need specific representation within the sample.
7. Snowball Sampling: Ideal for reaching hidden, hard-to-reach, or marginalized
populations.
8. Purposive Sampling: Best when studying a specific, well-defined population or a
unique group that directly relates to the research question.

Examples of Sampling in Research Studies


1. Education Study
 Objective: Investigate student study habits across grade levels.
 Sampling Method: Stratified sampling, where students are divided into
grades (strata) and randomly sampled from each grade.
2. Healthcare Study
 Objective: Examine patient satisfaction in a hospital network.
 Sampling Method: Cluster sampling, where hospitals (clusters) are selected,
and all patients within selected hospitals are surveyed.
3. Consumer Research
 Objective: Understand shopping preferences among young adults.
 Sampling Method: Convenience sampling, where young adults at a popular
mall are surveyed.
4. Social Science Study
 Objective: Study the experiences of refugees in a new country.
 Sampling Method: Snowball sampling, where initial participants (refugees)
refer others in their community.
Advantages and Disadvantages of Each Method
Method Advantages Disadvantages

Simple Representative, unbiased, Can be time-consuming, requires a


Random straightforward. complete list of the population.

Ensures all subgroups are represented, More complex, requires accurate


Stratified good for diverse populations. subgroup identification.

Simple to implement, evenly spaced Risk of hidden bias if population has


Systematic selection. patterns.

Cost-effective for large, dispersed Less precision, higher margin of error


Cluster populations. due to grouped sampling.

Limited generalizability, high risk of


Convenience Quick, easy, low-cost. sampling bias.

Ensures representation across groups, Non-random, may introduce selection


Quota efficient. bias.

Effective for hard-to-reach populations, Risk of network bias, limited


Snowball participant-driven expansion. generalizability.

Targeted, suitable for specific Non-random, subject to researcher bias,


Purposive populations with unique characteristics. not generalizable.
HYPOTHESIS TESTING
Definition :- Hypothesis testing compares two opposite ideas about a group of people or things and
uses data from a small part of that group (a sample) to decide which idea is more likely true.

Defining Hypotheses
 Null Hypothesis (H₀): The starting assumption. For example, "The average visits are 50."
 Alternative Hypothesis (H₁): The opposite, saying there is a difference. For example, "The
average visits are not 50."
Types of Hypothesis Testing
1. One-Tailed Test
Used when we expect a change in only one direction either up or down, but not both. For example,
if testing whether a new algorithm improves accuracy, we only check if accuracy increases.
There are two types of one-tailed test:
 Left-Tailed (Left-Sided) Test: Checks if the value is less than expected.
 Right-Tailed (Right-Sided) Test: Checks if the value is greater than expected.
2. Two-Tailed Test
Used when we want to see if there is a difference in either direction higher or lower. For example,
testing if a marketing strategy affects sales, whether it goes up or down
What are Type 1 and Type 2 errors in Hypothesis Testing?
 Type I error: When we reject the null hypothesis although that hypothesis was true. Type I
error is denoted by alpha(αα).
 Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted
by beta(ββ).
Null Hypothesis is True Null Hypothesis is False

Null Hypothesis is True


Correct Decision Type II Error (False Negative)
(Accept)

Alternative Hypothesis is True


Type I Error (False Positive) Correct Decision
(Reject)

Steps in Hypothesis testing/ How does Hypothesis testing works?


Step 1: Define Hypotheses:
 Null hypothesis (H₀): Assumes no effect or difference.
 Alternative hypothesis (H₁): Assumes there is an effect or difference.
Step 2: Choose significance level
We select a significance level (usually 0.05). This is the maximum chance we accept of wrongly
rejecting the null hypothesis (Type I error). It also sets the confidence needed to accept results.
Step 3: Collect and Analyse data.
 Gather the data,once collected we analyse the data using appropriate statistical methods to
calculate the test statistic.
Step 4: Calculate Test Statistic
The test statistic measures how much the sample data deviates from what we did expect if the null
hypothesis were true.
 Z-test: Used when population variance is known and sample size is large.
 T-test: Used when sample size is small or population variance unknown.
 Chi-square test: Used for categorical data to compare observed vs. expected counts.
Step 5: Make a Decision
 Using Critical Value:
o If test statistic > critical value → reject H0.
o If test statistic ≤ critical value → fail to reject H0.
 Using P-value:
o If p-value ≤ α → reject H0.
o If p-value > α → fail to reject H0.
Example: If p-value is 0.03 and α is 0.05, we reject the null hypothesis because 0.03 < 0.05.
Step 6: Interpret the Results
Based on the decision, we conclude whether there is enough evidence to support the alternative
hypothesis or if we should keep the null hypothesis.

Chi-Squared Test of Independence


The Chi-Squared Test of Independence is used to determine whether there is a significant
association between two categorical variables.
Example
Suppose you want to test whether gender (Male, Female) is independent of preference (Tea,
Coffee).
Here is the observed data (contingency table):

Row
Tea Coffee
Total

Male 20 30 50

Female 30 20 50

Column Total 50 50 100

Step-by-Step Implementation

Step 1: Set Hypotheses

 Null hypothesis (H₀): Gender and preference are independent.


 Alternative hypothesis (H₁): Gender and preference are not independent.

Step 2: Compute Expected Frequencies

Eij = (row total)i × (column total)j / grand total

Tea (Expected) Coffee (Expected)

Male (50×50)/100 = 25 (50×50)/100 = 25

Female (50×50)/100 = 25 (50×50)/100 = 25

Step 3: Compute Chi-Squared Statistic


X2 = ∑ (Oij - Eij )2 / Eij
Where Oij is the observed frequency and Eij is the expected frequency.

χ2 = (20−25)2 / 25 +(30−25)2 / 25 + (30−25)2 / 25 + (20−25)2/25


=1+1+1+1=4.0

Step 4: Determine Degrees of Freedom

Df = (Number of rows −1) × (Number of columns −1) = (2−1) × (2−1) = 1

Step 5: Find Critical Value / P-Value

Compare the test statistic to the critical value from the chi-squared distribution table at α = 0.05 and
df = 1:

 Critical value: 3.841


 Calculated χ²: 4.0

Since 4.0 > 3.841, we reject the null hypothesis.

Step 6: Conclusion

There is statistically significant evidence to suggest that gender and drink preference are not
independent.

ANOVA (ANALYSIS OF VARIANCE)

ANOVA also known as Analysis of variance is used to investigate relations between categorical
variables and continuous variables in the R programming language.
R - ANOVA Test
 Null Hypothesis: The statement that two or more groups are equal or that the effect size is
zero is sometimes expressed as the null hypothesis. The null hypothesis is commonly written
as H0.
 Alternate Hypothesis: The opposite of the null hypothesis is the alternative hypothesis.
Alternative hypotheses are sometimes referred to as H1 or HA.
UNIT - 1
TIMINGS IN R
It is the amount of time it takes for a particular operation ,function or set of operations
to execute in R code.

TIMING FUNCTIONS

1) Sys.time() – is used to determine current system time and can be called multiple
times to calculate elapsed time.
2) System.time() – returns the amount of CPU time used by the R process
3) Proc.time() – returns the amount of CPU time used.It is useful for measuring the
time taken by a sequence of expressions or function calls.

OPTIMIZING TIMING

1) Use vectorized operations : R is designed to work efficiently with vectors and


matrices.
2) Memory management : Use memory efficient data structures and remove
unnecessary objects from memory using rm() and gc()
3) Parallel processing : Consider using of multi core processors along with parallel
backends.
4) Profiling : identify the performance bottlenecks using tools like Rprof, profvis etc
5) Use efficient sorting algorithms : Choose a appropriate sorting algorithm,
depending on size and requirements.
6) Avoid global variables: Minimize the use of global variables, which can lead to
slower code execution.

CODE OPTIMIZATION TECHNIQUES

1) Vectorization : Instead of loops , use vectorized operations which can improve


the speed of R scripts.
2) Caching : Caching frequently used data or results can reduce the amount of
time spent recalculating them.
3) Parallelization : Splitting up tasks and running them parallel can improve the
speed.
4) Memory management : removing unused objects and avoiding excessive
copying can reduce the execution time.

VISIBILITY
It refers to the accessibility of objects (variables, functions and data) within different
scopes or environment.

a) Global scope : Objects defined in the global environment (top level) are
accessible from any part of your R script .
b) Local scope : Objects defined within a specific function are only accessible
within the functions.
Lexical Scoping in R

R uses lexical scoping (also called static scoping), meaning that the scope of a variable is determined by
where it is defined in the code, not where it is called.

VISIBILITY RULES

1) By default, objects in the global environment are visible to functions unless an


object with the same is defined within the function’s local scope.
2) If an object with the same name is defined within a local scope , it takes
precedence over the global object.

SCOPING FUNCTIONS

R provides functions like ls(), ls.str() and objects() that help you list and examine
objects within a specific environment.

ACCESSING NON VISIBLE OBJECTS

 Use get() to access global objects by name, even from within a function’s local
scope.

What is Data Visualization?


Data visualization is a technique of representing data as a graph, or in a pictorial format. This helps
the management to take the decisions precisely without even actually taking efforts to go through
the entire table.
Four basic plots are used in R Programming:
1. Barplots
2. Histogram
3. Box Plots
4. Scatter Plots
1. BAR PLOT
 Bar plot provide an easy method of representing categorical data in the form of bars.
 The length or height of each bar represents the value of the category it represents.
 In R, bar charts are created using the function barplot(), and it can be applied both for
vertical and horizontal charts.
Syntax: barplot(H, xlab, ylab, main, names.arg, col)
Parameters:
 H: This is a vector or matrix containing numeric values which are used in bar chart.
 xlab: label for x axis in bar chart.
 ylab: label for y axis in bar chart.
 main: title of the bar chart.
 names.arg: vector of names appearing under each bar in bar chart.
 col: Used to give colours to the bars in the graph.

2. HISTOGRAM
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. .
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)

Parameters:
 v: This parameter contains numerical values used in histogram.
 main: title of the chart.
 col: Used to set color of the bars.
 xlab: label for horizontal axis.
 border: Used to set border color of each bar.
 xlim: Used for plotting values of x-axis.
 ylim: Used for plotting values of y-axis.
 breaks: Used as width of each bar.

3. BOX PLOTS
A boxplot (also known as a box-and-whisker plot) is used to visualize the distribution of data
based on five key statistics (minimum, first quartile (Q1), median, third quartile (Q3), and
maximum).
Syntax: boxplot(x, data, notch, varwidth, names, main)
Parameters:
 x: This parameter sets as a vector or a formula.
 data: This parameter sets the data frame.
 notch: This parameter is the label for horizontal axis.
 varwidth: This parameter is a logical value. Set as true to draw width of the box
proportionate to the sample size.
 main: This parameter is the title of the chart.
 names: This parameter are the group labels that will be showed under each boxplot.

4. SCATTER PLOTS
A scatter plot is a set of dotted points representing individual data pieces on the horizontal and
vertical axis.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Parameters:
 x: Sets the horizonal coordinates.
 y: Sets the vertical coordinates.
 xlab: label for horizontal axis.
 ylab: label for vertical axis.
 main: title of the chart.
 xlim: This is used for plotting values of x.
 ylim: This is used for plotting values of y.
 axes: This indicates whether both axes should be drawn on the plot.

5. LINE PLOT
A line graph is a chart that is used to display information in the form of a series of data points.
R – Line Graphs
The plot() function in R is used to create the line graph.
Syntax: plot(v, type, col, xlab, ylab)
Parameters:
 v: This parameter is a contains only the numeric values
 type: This parameter has the following value:

1. "p" : This value is used to draw only the points.


2. "l" : This value is used to draw only the lines.
3. "o": This value is used to draw both points and lines
 xlab: This parameter is the label for x axis in the chart.
 ylab: This parameter is the label for y axis in the chart.
 main: This parameter main is the title of the chart.
 col: This parameter is used to give colors to both the points and lines.

You might also like