Lab Module 6 - Somatic Mutations
First login into the server.
Now enter the ~/workspace directory
Create a directory for this module and enter this directory:
In this section, we will set some environment variables to help facilitate the execution of commands. These variables will set to the location of some important files we need for the commands in this module. One important thing to remember is that:
These variables that you set will only persist in this current session you are in. If you log out and log back into the server, you will have to set these variables again.
Set the directory of where MutationSeq is installed:
Set the directory of where SnpEff is installed:
Summary of all the above environment commands (for copy and pasting convenience):
Retrieving the Data
We will be using the same exome data on the HCC1395 breast cancer cell line that was used in Module 5 (Copy Number Alterations). The tumour and normal bam files have been placed on the server. Create a link to the location of the bam files to allow us to access them quickly.
ln -s /home/ubuntu/CourseData/CG_data/HCC1395
Additionally, we will need the reference genome for the mutation calling. This has also been placed on the server. We shall create a link to this file for easy access:
ln -s /home/ubuntu/CourseData/CG_data/ref_data
We will restrict our analysis to a 1 Mb region (7Mb and 8Mb) within chromosome 17. These bam files along with their indices have been generated for you already (please see the data preparation page for details on how to generate these smaller bams):
You should see:
If you are unfamiliar with the sam/bam format, take some time to look at the alignments and metadata.
The header information contains information about the reference genome and read groups (per read information about sample/flow cell/lane etc).
samtools view -H HCC1395/exome/HCC1395_exome_tumour.17.7MB-8MB.bam | less -S
Press “q” to escape the less command when you are done viewing. The main contents of the file contain read alignments, tab separated, one line per alignment.
samtools view HCC1395/exome/HCC1395_exome_tumour.17.7MB-8MB.bam | less -S
Samtools will also calculate statistics about the reads and alignments. Unfortunately this information is not cached, and thus this command will take considerable time on a regular sized bam.
samtools flagstat HCC1395/exome/HCC1395_exome_tumour.17.7MB-8MB.bam
We will first call mutations using Strelka. Create a local copy of the Strelka config file. Strelka provides aligner specific config files for bwa, eland, and isaac. Each file contains default configuration parameters that work well with the aligner. The bam files we are working with were created using bwa, so we select that config file and make a local copy to make changes.
cp /usr/local/etc/strelka_config_bwa_default.ini config/strelka_config_bwa.ini
Since we will be using exome data for this, we need to change the
isSkipDepthFilters parameter in the strelka_config_bwa.ini file. Let’s create a new config file for exome analysis:
cp config/strelka_config_bwa.ini config/strelka_config_bwa_exome.ini
Now let’s edit the
config/strelka_config_bwa_exome.ini and change the
1. We will use the text editor nano for this:
Please let us know if you have any issues editing the file. The reason why we do this is described on the Strelka FAQ page:
The depth filter is designed to filter out all variants which are called above a multiple of the mean chromosome depth, the default configuration is set to filter variants with a depth greater than 3x the chromosomal mean. If you are using exome/targeted sequencing data, the depth filter should be turned off…
However in whole exome sequencing data, the mean chromosomal depth will be extremely small, resulting in nearly all variants being (improperly) filtered out.
If you were doing this for whole genome sequencing data, then you should leave this parameter set to 0 as the depth of coverage won’t be as high.
A Strelka analysis is performed in 2 steps. In the first step we provide Strelka with all the information it requires to run the analysis, including the tumour and normal bam filenames, the config, and the reference genome. Strelka will create an output directory with the setup required to run the analysis.
--tumor HCC1395/exome/HCC1395_exome_tumour.17.7MB-8MB.bam \
--normal HCC1395/exome/HCC1395_exome_normal.17.7MB-8MB.bam \
--ref ref_data/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
--config config/strelka_config_bwa_exome.ini \
The output directory will contain a makefile that can be used with the tool make. One benefit of the makefile style workflow is that it can be easily parallelized using qmake on a grid engine cluster.
To run the Strelka analysis, use make and specify the directory constructed by
configureStrelkaWorkflow.pl with make’s
make -C results/strelka/ -j 2
2 parameter specifies that we want to use 2 cores to run Strelka. Change this number to increase or decrease the parallelization of the job. The more cores the faster the job will be, but the higher the computational load on the machine that is running Strelka.
If you have access to a grid engine cluster, you can replace the command
qmaketo launch Strelka on the cluster. The
-jparameter also works in qmake.
Strelka has the benefit of calling SNVs and small indels. Additionally, Strelka calculates variant quality and filters the data in two tiers. The filenames starting with
passed contain high quality candidates, and filenames starting with
all contain high quality and marginal quality candidates.
The Strelka results are in VCF format, with additional fields explained on the Strelka website.
less -S results/strelka/results/passed.somatic.snvs.vcf
less -S results/strelka/results/passed.somatic.indels.vcf
For this lab, we will only be focusing on the SNV file. Explanation of the fields in the VCF are explained in the header of the VCF file.
Now we will try another mutation caller called MutationSeq. MutationSeq uses supervised learning (random forest) to classify each mutation as true or artifact. The training set consists of over 1000 known true and positive mutations validated by deep SNV sequencing. The trained model is provided with the MutationSeq package, but must be provided on the command line using the
Running MutationSeq is a one step process.
mkdir -p results/mutationseq
python $MUTATIONSEQ_DIR/classify.py \
-i 17:7000000-8000000 \
-c $MUTATIONSEQ_DIR/metadata.config -q 1 -o results/mutationseq/HCC1395.museq.vcf \
-l results/mutationseq/HCC1395.log --all &
Some parameters to take note of:
-ito specify the region
1to remove ambiguously mapping reads
--allto print variants that are classified as low probability
This job will run in the background. You can check its progress by going:
less -S results/mutationseq/HCC1395.log
When MutationSeq has finished, the results will be provided in VCF format:
less -S results/mutationseq/HCC1395.museq.vcf
Filtering on these results are left to the end-user. Explanation of the fields in the VCF are explained in the header of the VCF file.
Converting the VCF format into a tabular format
The VCF format, while being the de facto standard for SNV results, is sometimes not useful for visualization and data exploration purposes which often requires the data to be in tabular format. We can convert from VCF format to tabular format using the extractField() function from SnpEff/SnpSift program. Since each mutation caller has a different set of output values in the VCF file, the command needs be adjusted for the mutation caller.
For example, to convert the Strelka VCF file into a tabular format:
java -jar $SNPEFF_DIR/SnpSift.jar extractFields -e "." results/strelka/results/passed.somatic.snvs.vcf \
CHROM POS ID REF ALT QUAL FILTER QSS TQSS NT QSS_NT TQSS_NT SGT SOMATIC \
GEN.DP GEN.DP GEN.FDP GEN.FDP GEN.SDP GEN.SDP GEN.SUBDP \
GEN.SUBDP GEN.AU GEN.AU GEN.CU GEN.CU GEN.GU GEN.GU GEN.TU \
GEN.TU > results/strelka/results/passed.somatic.snvs.txt
To convert the MutationSeq VCF file into a tabular format:
java -jar $SNPEFF_DIR/SnpSift.jar extractFields -e "." results/mutationseq/HCC1395.museq.vcf \
CHROM POS ID REF ALT QUAL FILTER PR TR TA NR NA TC NI ND > \
The -e parameter specifies how to represent empty fields. In this case, the “.” character is placed for any empty fields. This facilitates loading and completeness of data. For more details on the extractField() function see the SnpSift documentation.
Integrative Genomics Viewer (IGV)
A common step after prediction of SNVs is to visualize these mutations in IGV. Let’s load these bam into IGV. Open IGV, then:
- Change the genome to hg19 (if it isn’t already)
- File > Load from URL …
Where the xx is your student number. Once the tumour and normal bam have been loaded into IGV, we can investigate a few predicted positions in IGV:
- 17:7491818: Predicted by both Strelka and MutationSeq
- 17:7578406: Predicted only by Strelka
- 17:7482929: Potential false positive predicted by Strelka
Manually inspecting these predicted SNVs in IGV is a good way to verify the predictions and also identify potential false positives:
When possible, you should always inspect SNVs in IGV
Analyzing the SNV Results in R
While IGV is good for visualizing individual mutations, looking at more global characteristics would require loading the data into an analysis language like R:
We will use exome-wide SNV predictions for Strelka and MutationSeq for these analyses. These processed tabular text files along with the
analyze-SNV.Rmd RMarkdown file that contains the R code for the analysis can be downloaded from the wiki (SNV Data Analysis Package).
Download the package, extract it and then open the
analyze-SNV.Rmd file in RStudio now.