Class 12 Pt. 2

Kyle Canturia (A17502778)

Section 1: Identify genetic variants of interest

Downloaded data from Ensembl as a csv file

mxl <- read.csv("373531-SampleGenotypes-Homo_sapiens_Variation_Sample_rs8067378.csv")
head(mxl)

  Sample..Male.Female.Unknown. Genotype..forward.strand. Population.s. Father
                NA19648 (F)                       A|A ALL, AMR, MXL      -
                NA19649 (M)                       G|G ALL, AMR, MXL      -
                NA19651 (F)                       A|A ALL, AMR, MXL      -
                NA19652 (M)                       G|G ALL, AMR, MXL      -
                NA19654 (F)                       G|G ALL, AMR, MXL      -
                NA19655 (M)                       A|G ALL, AMR, MXL      -
  Mother
    -
    -
    -
    -
    -
    -

table(mxl$Genotype..forward.strand.)

A|A A|G G|A G|G 
 22  21  12   9 

GBR dataset

gbr <- read.csv("373522-SampleGenotypes-Homo_sapiens_Variation_Sample_rs8067378.csv")
head(gbr)

  Sample..Male.Female.Unknown. Genotype..forward.strand. Population.s. Father
                HG00096 (M)                       A|A ALL, EUR, GBR      -
                HG00097 (F)                       G|A ALL, EUR, GBR      -
                HG00099 (F)                       G|G ALL, EUR, GBR      -
                HG00100 (F)                       A|A ALL, EUR, GBR      -
                HG00101 (M)                       A|A ALL, EUR, GBR      -
                HG00102 (F)                       A|A ALL, EUR, GBR      -
  Mother
    -
    -
    -
    -
    -
    -

Section 4: Population Scale Analysis

pop <- read.table("rs8067378_ENSG00000172057.6.txt")
head(pop)

   sample geno      exp
HG00367  A/G 28.96038
NA20768  A/G 20.24449
HG00361  A/A 31.32628
HG00135  A/A 34.11169
NA18870  G/G 18.25141
NA11993  A/A 32.89721

#reads dataset and assigns it to a variable

Q13: Read this file into R and determine the sample size for each genotype and their corresponding median expression levels for each of these genotypes.

table(pop$geno)

A/A A/G G/G 
108 233 121 

##only looks at the column "geno" and returns count of how many of each genotype is present

Theres 108 A|A, 233 A|G, and 121 G|G.

library(dplyr) 

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

  pop %>%
  group_by(geno) %>%
  summarize(geno_median = median(exp))

# A tibble: 3 × 2
  geno  geno_median
  <chr>       <dbl>
1 A/A          31.2
2 A/G          25.1
3 G/G          20.1

##uses dplyr tools to first group the samples by genotype, then summarizes the median of each genotype

The median expression levels for A|A, A|G, and G|G are 31.24, 25.06, and 20.07 respectively.

Q14: Generate a boxplot with a box per genotype, what could you infer from the relative expression value between A/A and G/G displayed in this plot? Does the SNP effect the expression of ORMDL3?

library(ggplot2)
ggplot(pop) + 
  aes(geno, exp) +
  geom_boxplot()

##makes boxplot with geno on the x axis and exp on the y

A|A has the highest expression value while G|G has the lowest, and A|G is in between. From the plot, it’s evident that the SNP is associated with differing expressions of ORMDL3.