R Homework
R Homework
Q1 - Q4
Experiment with the following code in order to the answer the following questions:
1. What is k the first time the loop is run?
2. What is i the second time the loop is run?
3. Which line of code does nothing when run inside the for-loop and prints values
when run by itself?
4. Which line of code would be the most reasonable to change to only instantiate df on
the second round of the loop?
vec <- 5:2
for (i in vec) {
j <- i ^ 2
j
k <- j - 1
print(k)
if(i == vec[1]){
df <- data.frame(stringsAsFactors = F)
}
if(exists("df")){
df <- rbind(df,data.frame(i,j,k))
}
}
## [1] 24
## [1] 15
## [1] 8
## [1] 3
#Generate data
y <- rnorm(n, mean = u, sd = stdev)
if (roundToggle) {
y <- round(y)
}
if (naLtMinDepthToggle) {
y[y < minDepth] <- NA
}
Q5
In the simulated read depth data below, what is the chance that any given loci was not
sequenced?
You will need to analyze this dataset further than I already have to determine this.
Because I set the seed for pseudorandom number generation, you can recreate this exact
dataset by using the same seed. Check your summary statistics against mine to validate that
you have done so successfully.
# Simulate the RD
simOut1 <- rdSim(
n = 100000,
u = 5.542377,
theSeed = 2,
stdev = 6,
roundToggle = T,
naLtMinDepthToggle = F
)
# Print summary statistics
summary(simOut1)
sd(simOut1, na.rm = T)
## [1] 6.012657
sd(simOut2, na.rm = T)
## [1] 5.029079
Q7
Upload your script to turn my plotting script from above into a stand-alone function that
saves the histogram to a file. Nothing needs to be returned, because it’s output is directed
to the external file.
Guidelines
DO NOT REFERENCE ANY OBJECTS FROM OUTSIDE THE FUNCTION!
• All caps because people will do it otherwise and it won’t work when your TAs grade
it …
• The point of a function is to isolate an analysis, so don’t point to objects outside of
the function!
• Easiest way to force this to happen is only use new object names.
Name the function simRdPlot
Take the following arguments in this order:
• rdVector: A numeric vector of the RD. May contain missing values.
• histFilename: This is a length 1 character vector that gives the path of the output
object.
• width
– This length 1 numeric vector is passed to png() to determine the width of the
file in mm. The default unit is “in”, so you need to set units to “mm”.
– Should default to 168
• height
– This length 1 numeric vector is passed to png() to determine the height of the
file in mm. The default unit is “in”, so you need to set units to “mm”.
– Should default to 84
• res
– This length 1 numeric vector is passed to png to determine the resolution in
dots per square inch. This is the default unit for res.
– Should default to 300
How it will be graded
• I will run rm(list = ls() to clear their environment,
• source the rdSim function above,
• source the function script you upload,
• create a rdSim output called vectorUsed,
• and then call the following script.
• Your script should produce the images at the indicated filepath.
#Create simulated read depths to be plotted
vectorUsed <- rdSim(10000, 10, 3, 1,T, T, 4)
## png
## 2
simRdPlot(rdVector = vectorUsed,
height = 168, res = 150,
histFilename = studentFilepath_tall)
## png
## 2
simRdPlot(rdVector = vectorUsed,
histFilename = studentFilepath_lowRes,
84,84,75)
## png
## 2
Transform samtools.depth
In order to compare samples, it is often convenient to standardize their distributions. For
normal distributions, you can convert the original distribution to a standard-normal
distribution by subtracting every value by the mean value and dividing by the standard
deviation. This results in a mean of 0 and a standard deviation of 1 and facilitates further
comparisons. See here for more details: https://www.scribbr.com/statistics/standard-
normal-distribution/.
Often times, when I have a multiple samples, I like to use median and MAD (Median
Absolute Deviation) instead of mean and standard deviation to transform the data into a
standard normal distribution. I prefer median and MAD because they help account for
issues like the heavy tail observed in chr4_group2 of the samtool.depth data.
Write a function that transforms the following data based on the data’s median and MAD.
These calculations should remove NA values.
Use the results and the samtools.depth data to answer the following questions.
url <-
"https://utexas.box.com/shared/static/rrtbkan08hl7vgmffip87iv96splwq15.zip"
fileName <- "chr4.depth.out"
if (!file.exists(fileName)) {
zipName <- paste0(fileName,".zip")
download.file(url,destfile = zipName)
unzip(zipName,files = fileName)
}
samtools.depth <- read.table(fileName,stringsAsFactors = F)
Q8
What is the median and mad read depth of chr4_group2 before transforming?
Example hist of chr4_group2
Q9
Use the standardizing function you created to transform the read depths of chr4_group_2
using it’s median and mad.
What are the new median and mad of the transformed values?
Example hist of transformed chr4_group2
rdLoop Function
Use the following script to create a highly distorted simulated distribution
rdLoop <- function(N,U,Sd,MinDepth,Sims,verbose=T){
for(i in 1:Sims){
#Instantiate df
if(i == 1){
df <- data.frame(stringsAsFactors = F)
}
#Simulate reads
##Set theSeed to i so that it is random between
### iterations, but is predictable across simulations
vectorSim <- rdSim(
N, U, Sd, theSeed = i, minDepth = MinDepth,
roundToggle = T, naLtMinDepthToggle = T
)
#Save summary
if(any(!is.na(vectorSim))){
iDf <- data.frame(
i = i, n = N, u = U,
sd = Sd, minRd = MinDepth,
mean = mean (vectorSim,na.rm=T),
sd = sd (vectorSim,na.rm=T),
median = median(vectorSim,na.rm=T),
cntNotMissing = sum(!is.na(vectorSim))
)
df<-rbind(df,iDf)
}else{
warning(i," had no non-missing obs. Continuing...")
}
#Monitor progress
if(i%%1000==1 & verbose){
print(i)
}
}
return(df)
}
loopedDf <- rdLoop(N = 100, U = 10, Sd = 6,
MinDepth = 3, Sims = 10000)
## [1] 1
## [1] 1001
## [1] 2001
## [1] 3001
## [1] 4001
## [1] 5001
## [1] 6001
## [1] 7001
## [1] 8001
## [1] 9001
Q10
Currently, the above simulates sampling read depth across 100 loci from a population
(e.g. a solution of DNA) with a mean of 10 and a sd of 6. However, sometimes the mean RD
of the sample was quite a bit more than 10!
This is expected to occur more often than not since the distribution is skewed to the right
due to being truncated at the “minimum read depth”.
In this experiment, what was the proportion of samples where the mean RD was greater
than or equal to 10?
Q11
What proportion of the simulated samples had a mean RD greater than or equal to 12.5?
Since this proportion is less than 0.01, if you tested a single 100 locus sample and found
that it’s mean RD was 12.5, you could reject the null hypothesis that it came from a
population with a mean RD of 10 and a standard deviation of 6 at a critical value of 0.01.
Example histogram of the sample mean:
Q12
Modify the script above to simulate only 50 loci measured with all other parameters held
constant.
What is the probability of the simulated samples with only 50 loci having a mean RD
greater than or equal to 12.5?
Would you still reject the null hypothesis at a 0.01 critical value?
Would you still reject the null hypothesis at a 0.05 critical value?
Example histogram of the sample mean:
hist(loopedDf50$mean,breaks = 100,xlim = xLims)
abline(v = 12.5,col="red")
Q13
Going back to 100 loci measured per simulation, change the minimum read depth to call to
5.
What is the proportion of the simulated samples having a mean RD greater than or equal to
12.5?
Example histogram of the sample mean:
hist(loopedDf5MinRd$mean,breaks = 100,xlim = xLims)
abline(v = 12.5,col="red")
Q14
TRUE or FALSE: Increasing the minimum read depth of included observations increases
the total number of non-missing observations (cntNotMissing).
Q15
TRUE or FALSE: If I want to print a base R object from inside a loop or function, than I have
to manually use the print function.