logo

Analysis of Metagenomic Data 2016

Integrated Assignment 2


Answers to Integrated Assignment Day 2

Q1) How many total sample files do we have?

A1: There are two ways to do this:

• List and count the number of *.fasta files we have in our folder

ls *.fasta | wc -l

• Count the lines in the metadata file. Each row in the metadata file corresponds to one sample. Remember to subtract 1 as the first row contains the headers

wc -l metadata-file-for-osd-subset-210615.txt

There are 25 sample files.

Q2) How many sequences does the sample from OSD station 10 contain?

A2: Since we are dealing with fasta files we can do this by counting the number of “>” characters in the file using the grep command

grep -c “>” OSD10.comb.qc.masked.dedup.subsample.fasta

There are 100257 sequences.

Q3) How many samples of each type are there in each of the different Province code categories?

A3: Again we can use the metadata file for this. The prov_code is in column 12 in the file and we can use the sort and uniq commands to to get a count of each type.

cut -f 12 metadata-file-for-osd-subset-210615.txt | sort |uniq -c

There are 9 ARCT samples and 16 NWCS samples.

Q4) In the STAMP analysis of the Metaphlan results, do you see any separation in the samples when the PCA is coloured by Depth?

A4: No.

Q5) Do you see any separation in the samples when the PCA is coloured by the province codes? If so, describe which PC axis differentiates these samples.

A5: Yes, there is a slight separation between the Arctic samples and the Northwest Atlantic samples. The separation is seen in the PC1 vs PC2 and occurs partly along PC2.

Q6) In a “multiple group test” using ANOVA with no multiple test correction how many species are statistically significant?

A6: 4

Q7) How many are still significant in the “two group test” using White’s non-parametric t-test without and with Benjamini-hochberg FDR for multiple test correction?

A7: Without Benjamini-Hochberg FDR: 27. With: 23.

Q8) What are the top 3 Modules present in the 1m sample from the Bedford basin (station 152)?

A8: You can find this out from the kos.spf file that we generated. You can view the kos.spf file using less and manually count the column number where the OSD152 1m sample occurs. In this case it is 13.

less kos.spf

Now we will cut out that column (which contains the abundances pf the modules for OSD152 1m) and sort it based on the abundance values and list out the top 10 most abundant values

cut -f 13 kos.spf | sort -gr |head

Next we will take the top most abundant value (0.00419118 in this case) and grep the 1st column (which contains the pathway name) using this value

cut -f 1,13 kos.spf | grep 0.00419118

You can do the same for the top 3 abundance values.

The corresponding top 3 module names are as follows:

K07497: putative transposase

K00540

K00525: ribonucleoside-diphosphate reductase alpha chain

Q9) In the STAMP analysis of the Humann results using a two group test with no multiple test correction applied how many significant differences are seen between the Arctic and Northwest Atlantic samples?

A9: 645

Q10) What happens when the p-value cut-off is lowered to 0.01 for Q9?

A10: The significantly different modules decrease to 154

Q11) What is the most significantly different KEGG orthology group? What is the p-value for this KO?

A11: K01130: arylsulfatase. P-value: 1.10e-4

Q12) Change the p-value to 0.001 and create an “Extended error bar” plot and save the image as a .png using the File->Save Plot option.

A12:

A12

View on GitHub