0% found this document useful (0 votes)
92 views13 pages

R Homework

The document describes a read depth simulation function (rdSim) and provides sample code to generate simulated read depth data. It then asks a series of questions about analyzing the sample read depth data and modifying the rdSim function. Key details include calculating the probability of a locus not being sequenced from the sample data, identifying arguments needed to recreate a provided histogram, and writing a function to save histograms of read depth data to file.

Uploaded by

Testa Mesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views13 pages

R Homework

The document describes a read depth simulation function (rdSim) and provides sample code to generate simulated read depth data. It then asks a series of questions about analyzing the sample read depth data and modifying the rdSim function. Key details include calculating the probability of a locus not being sequenced from the sample data, identifying arguments needed to recreate a provided histogram, and writing a function to save histograms of read depth data to file.

Uploaded by

Testa Mesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

R assessment

Q1 - Q4
Experiment with the following code in order to the answer the following questions:
1. What is k the first time the loop is run?
2. What is i the second time the loop is run?
3. Which line of code does nothing when run inside the for-loop and prints values
when run by itself?
4. Which line of code would be the most reasonable to change to only instantiate df on
the second round of the loop?
vec <- 5:2
for (i in vec) {
j <- i ^ 2
j
k <- j - 1
print(k)
if(i == vec[1]){
df <- data.frame(stringsAsFactors = F)
}
if(exists("df")){
df <- rbind(df,data.frame(i,j,k))
}
}

## [1] 24
## [1] 15
## [1] 8
## [1] 3

Read depth simulator


Read through the following function meant to simulate read depth for a sample across n
unrelated loci.
rdSim <- function(
n, #length 1 positive integer
u, #length 1 number
stdev, #length 1 positive num
theSeed, #length 1 integer
roundToggle, #length 1 logical
naLtMinDepthToggle, #length 1 logical
minDepth #length 1 number
){
#Check input values
if (n %% 1 != 0) {
warning("coercing n of ",n, " to ", round(n))
n <- round(n)
}
if (n < 1 | stdev <= 0) {
stop("n and sd must be above 0")
}

#Makes pseudorandom generation predictable


set.seed(theSeed)

#Generate data
y <- rnorm(n, mean = u, sd = stdev)
if (roundToggle) {
y <- round(y)
}
if (naLtMinDepthToggle) {
y[y < minDepth] <- NA
}

#Resets the pseudorandom seed


set.seed(NULL) # Not needed from inside function

#Return data and exit function


return(y)
}

Q5
In the simulated read depth data below, what is the chance that any given loci was not
sequenced?
You will need to analyze this dataset further than I already have to determine this.
Because I set the seed for pseudorandom number generation, you can recreate this exact
dataset by using the same seed. Check your summary statistics against mine to validate that
you have done so successfully.
# Simulate the RD
simOut1 <- rdSim(
n = 100000,
u = 5.542377,
theSeed = 2,
stdev = 6,
roundToggle = T,
naLtMinDepthToggle = F
)
# Print summary statistics
summary(simOut1)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## -19.000 2.000 6.000 5.561 10.000 32.000

sd(simOut1, na.rm = T)

## [1] 6.012657

# This math centers each bin around integers


bottomBreak <- min(c(floor(min(simOut1,na.rm = T))-1/6,-1/6))

topBreak <- ceiling(max(simOut1,na.rm = T)) + 1/6

breaksForHist <- seq(


from = bottomBreak,
to = topBreak,
by = 1/3
)

# Create a x-axis label for plot


lociCnt <- format(length(simOut1),big.mark = ",")

# Plot the histogram


hist(x = simOut1,
breaks = breaksForHist,
xlim = c(min(breaksForHist),
max(breaksForHist)),
xlab = paste0("Read depth (",lociCnt," loci)"),
main = NULL)
abline(v = mean(simOut1,na.rm = T), col="red")
Q6
What are the values of the rdSim arguments needed to produce the following histogram?
Hints:
• Use the seed = 2 to imitate my results precisely. FYI, set.seed is consistent across
versions of R.
• I used integer values for all number based arguments.
• The quality-cutoffs distort the summary statistics from those used to create the
original data.
• Base your arguments off of the image primarily!
# Print summary statistics
summary(simOut2)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


## 5.00 9.00 13.00 13.23 17.00 42.00 105882

sd(simOut2, na.rm = T)

## [1] 5.029079

# This math centers each bin around integers


bottomBreak<- min(c(
floor(min(simOut2,na.rm = T))-1/6,
-1/6
))
topBreak<- ceiling(max(simOut2,na.rm = T)) + 1/6
breaksForHist <- seq(
from = bottomBreak,to = topBreak,by = 1/3
)
# Create a x-axis label for plot
lociCnt <- format(length(simOut2),big.mark = ",")
# Plot the histogram
hist(x = simOut2,
breaks = breaksForHist,
xlim = c(min(breaksForHist),max(breaksForHist)),
xlab = paste0("Read depth (",lociCnt," loci)"),
main = NULL)
abline(v = mean(simOut2,na.rm = T),col="red")

Q7
Upload your script to turn my plotting script from above into a stand-alone function that
saves the histogram to a file. Nothing needs to be returned, because it’s output is directed
to the external file.
Guidelines
DO NOT REFERENCE ANY OBJECTS FROM OUTSIDE THE FUNCTION!
• All caps because people will do it otherwise and it won’t work when your TAs grade
it …
• The point of a function is to isolate an analysis, so don’t point to objects outside of
the function!
• Easiest way to force this to happen is only use new object names.
Name the function simRdPlot
Take the following arguments in this order:
• rdVector: A numeric vector of the RD. May contain missing values.

• histFilename: This is a length 1 character vector that gives the path of the output
object.

• width

– This length 1 numeric vector is passed to png() to determine the width of the
file in mm. The default unit is “in”, so you need to set units to “mm”.
– Should default to 168
• height
– This length 1 numeric vector is passed to png() to determine the height of the
file in mm. The default unit is “in”, so you need to set units to “mm”.
– Should default to 84
• res
– This length 1 numeric vector is passed to png to determine the resolution in
dots per square inch. This is the default unit for res.
– Should default to 300
How it will be graded
• I will run rm(list = ls() to clear their environment,
• source the rdSim function above,
• source the function script you upload,
• create a rdSim output called vectorUsed,
• and then call the following script.
• Your script should produce the images at the indicated filepath.
#Create simulated read depths to be plotted
vectorUsed <- rdSim(10000, 10, 3, 1,T, T, 4)

#Run the student's function (must be named simRdPlot).


simRdPlot(rdVector = vectorUsed,
histFilename = studentFilepath)

## png
## 2
simRdPlot(rdVector = vectorUsed,
height = 168, res = 150,
histFilename = studentFilepath_tall)

## png
## 2

simRdPlot(rdVector = vectorUsed,
histFilename = studentFilepath_lowRes,
84,84,75)

## png
## 2

#Check for files


imageFiles <- list.files(
path = dirname(studentFilepath),
pattern = gsub(pattern = "\\.png$","",
x = basename(studentFilepath))
)

# File sizes may vary slightly depending on OS.


file.info(imageFiles)

## size isdir mode mtime


## studentIdQ7Plot.png 13140 FALSE 666 2022-02-14 19:47:41
## studentIdQ7Plot_lowRes.png 1894 FALSE 666 2022-02-14 19:47:41
## studentIdQ7Plot_tall.png 10968 FALSE 666 2022-02-14 19:47:41
## ctime atime exe
## studentIdQ7Plot.png 2022-02-14 10:42:49 2022-02-14 19:47:41 no
## studentIdQ7Plot_lowRes.png 2022-02-14 10:42:49 2022-02-14 19:47:41 no
## studentIdQ7Plot_tall.png 2022-02-14 10:42:49 2022-02-14 19:47:41 no

Transform samtools.depth
In order to compare samples, it is often convenient to standardize their distributions. For
normal distributions, you can convert the original distribution to a standard-normal
distribution by subtracting every value by the mean value and dividing by the standard
deviation. This results in a mean of 0 and a standard deviation of 1 and facilitates further
comparisons. See here for more details: https://www.scribbr.com/statistics/standard-
normal-distribution/.
Often times, when I have a multiple samples, I like to use median and MAD (Median
Absolute Deviation) instead of mean and standard deviation to transform the data into a
standard normal distribution. I prefer median and MAD because they help account for
issues like the heavy tail observed in chr4_group2 of the samtool.depth data.
Write a function that transforms the following data based on the data’s median and MAD.
These calculations should remove NA values.
Use the results and the samtools.depth data to answer the following questions.
url <-
"https://utexas.box.com/shared/static/rrtbkan08hl7vgmffip87iv96splwq15.zip"
fileName <- "chr4.depth.out"
if (!file.exists(fileName)) {
zipName <- paste0(fileName,".zip")
download.file(url,destfile = zipName)
unzip(zipName,files = fileName)
}
samtools.depth <- read.table(fileName,stringsAsFactors = F)

Q8
What is the median and mad read depth of chr4_group2 before transforming?
Example hist of chr4_group2

Q9
Use the standardizing function you created to transform the read depths of chr4_group_2
using it’s median and mad.
What are the new median and mad of the transformed values?
Example hist of transformed chr4_group2
rdLoop Function
Use the following script to create a highly distorted simulated distribution
rdLoop <- function(N,U,Sd,MinDepth,Sims,verbose=T){
for(i in 1:Sims){
#Instantiate df
if(i == 1){
df <- data.frame(stringsAsFactors = F)
}
#Simulate reads
##Set theSeed to i so that it is random between
### iterations, but is predictable across simulations
vectorSim <- rdSim(
N, U, Sd, theSeed = i, minDepth = MinDepth,
roundToggle = T, naLtMinDepthToggle = T
)
#Save summary
if(any(!is.na(vectorSim))){
iDf <- data.frame(
i = i, n = N, u = U,
sd = Sd, minRd = MinDepth,
mean = mean (vectorSim,na.rm=T),
sd = sd (vectorSim,na.rm=T),
median = median(vectorSim,na.rm=T),
cntNotMissing = sum(!is.na(vectorSim))
)
df<-rbind(df,iDf)
}else{
warning(i," had no non-missing obs. Continuing...")
}
#Monitor progress
if(i%%1000==1 & verbose){
print(i)
}
}
return(df)
}
loopedDf <- rdLoop(N = 100, U = 10, Sd = 6,
MinDepth = 3, Sims = 10000)

## [1] 1
## [1] 1001
## [1] 2001
## [1] 3001
## [1] 4001
## [1] 5001
## [1] 6001
## [1] 7001
## [1] 8001
## [1] 9001
Q10
Currently, the above simulates sampling read depth across 100 loci from a population
(e.g. a solution of DNA) with a mean of 10 and a sd of 6. However, sometimes the mean RD
of the sample was quite a bit more than 10!
This is expected to occur more often than not since the distribution is skewed to the right
due to being truncated at the “minimum read depth”.
In this experiment, what was the proportion of samples where the mean RD was greater
than or equal to 10?

Q11
What proportion of the simulated samples had a mean RD greater than or equal to 12.5?
Since this proportion is less than 0.01, if you tested a single 100 locus sample and found
that it’s mean RD was 12.5, you could reject the null hypothesis that it came from a
population with a mean RD of 10 and a standard deviation of 6 at a critical value of 0.01.
Example histogram of the sample mean:

Q12
Modify the script above to simulate only 50 loci measured with all other parameters held
constant.
What is the probability of the simulated samples with only 50 loci having a mean RD
greater than or equal to 12.5?
Would you still reject the null hypothesis at a 0.01 critical value?
Would you still reject the null hypothesis at a 0.05 critical value?
Example histogram of the sample mean:
hist(loopedDf50$mean,breaks = 100,xlim = xLims)
abline(v = 12.5,col="red")

Q13
Going back to 100 loci measured per simulation, change the minimum read depth to call to
5.
What is the proportion of the simulated samples having a mean RD greater than or equal to
12.5?
Example histogram of the sample mean:
hist(loopedDf5MinRd$mean,breaks = 100,xlim = xLims)
abline(v = 12.5,col="red")
Q14
TRUE or FALSE: Increasing the minimum read depth of included observations increases
the total number of non-missing observations (cntNotMissing).

Q15
TRUE or FALSE: If I want to print a base R object from inside a loop or function, than I have
to manually use the print function.

You might also like