Overview

In this assignment you will use genome-wide association (GWA) mapping methods to identify genetic loci associated with variation in body weight seed beetle. You will compare GWA mapping results based on standard linear regression and LASSO regression. You will upload your R script with your code. Your script must include comments (i.e., annotations) that demonstrate that you know what your code is doing. You must also include brief answers to the question posed for each task as comments in your code. You can find the rubric I will use for assessment of this assignment on the assignment page (i.e., where you downloaded this document).

The data for this assignment come from two files. compbioX.txt is a text file with genetic data. This is for a seed beetle, Callosobruchus maculatus. There are 27931 rows; each row contains the data for a genetic locus (specifically for a single nucleotide polymorphism or SNP). The rows denote each beetle, while the columns denote genotypes of these beetles at each locus, one column per individual. In other words, the data set includes 27931 genetic loci scored in 249 individuals. The genotypes are estimates of the number of non-reference alleles each individual has; they range from 0 to 2, but are not constrained to be integer values (not too important for what you are doing). The second file provides the trait data and is called compbioY.txt. It has 249 rows, one per seed beetle. There is just one column. The column contains the values in mg for the body weight of each beetle. Your goal is to identify genetic loci associated with body weight.

The genetic data file is big. You can read it in with read.table() but it will be slow. There is a function, fread, available from the data.table package. Consider installing this package and using fread instead of read.table (this is not required).

library(data.table)
#Read genotype data
G <- fread("compbioX.txt", header = FALSE)
#Additionally, you can use the T function to orient this data however is useful for your analysis
G <- t(G)

Task 1

Your first task is to fit a standard linear model for weight as a function of each genetic locus (SNP). Only include one genetic locus in the model at a time. Thus, you will fit 27,931 linear models. In your comments, report the number of covariates with P values less than 0.05. To strictly control for false positives, you would need to only consider associations as being statistically significant if the P value is less than 0.05 / 27,931. This is called a strict Bonferroni correction. How many associations meet this criterion?

Also include code that plots the regression coefficients (x-axis) against the -log10 P values (y-axis). You do not need to comment on the plots but do make sure you understand what you see when you make them. To complete this task you will need to automate extracting the regression coefficients and P values from the regression output. The cartoon example below should help you with this (this is not the exact code you need, but is conceptually similar to what you need to do).

#Load some data
data(trees)
#Fit a model
o <- lm(trees$Girth ~ trees$Height)
#Convert to summary output with p-values
sout <- summary(o)
#Get beta and p value, row 2 columns 1 and 4
sout
## 
## Call:
## lm(formula = trees$Girth ~ trees$Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2386 -1.9205 -0.0714  2.7450  4.5384 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -6.18839    5.96020  -1.038  0.30772   
## trees$Height  0.25575    0.07816   3.272  0.00276 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.728 on 29 degrees of freedom
## Multiple R-squared:  0.2697, Adjusted R-squared:  0.2445 
## F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758

Task 2

A key limitation of the analysis above is that we had too many covariates to fit a single model and we would ideally need to control for all of the tests we did by reducing our significance threshold and thereby reducing power. We will try an alternative approach to model phenotype as a function of genotype, the LASSO regression, that partially overcomes these limitations. Specifically, fit a LASSO model for weight (as a numeric value) as a function of the full set of genetic data. Make sure to chose a value for λ first for using 10-fold cross-validation. Next, compute and report (in the comments) the model r2 . Last, make two plots: (i) a plot of the LASSO regression coefficients (betas) along the genome (similar to your plot above), and (ii) a scatter plot of the LASSO regression coefficients versus regression coefficients for each SNP from Task 1. Use the absolute value of the regression coefficients (betas) for these plot. Comment (as comments in the code) on the similarity or lack there of between the results from the standard linear model analyses and the LASSO regression.