Informatics for High-Throughput Sequencing Data 2016

Module 6 lab

This lab was created by Guillaume Bourque.

Table of contents

  1. Overview
  2. Align DNA with BWA-MEM
  3. Characterize the fragment size distribution
  4. Run DELLY to detect SVs
  5. Setting up IGV for SV visualization
  6. Explore the SVs
  7. Overall script
  8. (Optional) Look for other SVs
  9. Acknowledgements


The goal of this practical session is to identify structural variants (SVs) in a human genome by identifying both discordant paired-end alignments and split-read alignments that. If you recall from the lecture, discordant paired-end alignments conflict with the alignment patterns that we expect (i.e., concordant alignments) for the DNA library and sequencing technology we have used. For example, given a ~500bp paired-end Illumina library, we expect pairs to align in F/R orientation and we expect the ends of the pair to align roughly 500bp apart. Pairs that align too far apart suggest a potential deletion in the DNA sample’s genome. As you may have guessed, the trick is how we define what “too far” is — this depends on the fragment size distribution of the data. Split-read alignments contain SV breakpoints and consequently, then DNA sequences up- and down-stream of the breakpoint align to disjoint locations in the reference genome.

In this session, we will use DELLY, a SV detection tool. DELLY is an integrated structural variant prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read massively parallel sequencing data. It uses paired-ends and split-reads to sensitively and accurately delineate genomic rearrangements throughout the genome. If you are interested in DELLY, you can read the full manuscript here.

The dataset we are using comes from the Illumina Platinum Genomes Project, which is a 50X-coverage dataset of the NA12891/NA12892/NA12878 trio. The raw data can be downloaded from the following URL.


Our focus will be on chromosome 20, as processing three whole human genomes worth of data is intractable given the time allowed for this session.


Amazon node

Read these directions for information on how to log in to your assigned Amazon node.

Work directory

Let’s create a new work directory for this module within our workspace:

rm -rf ~/workspace/module6
mkdir -p ~/workspace/module6
cd ~/workspace/module6

Create symbolic links to all of the files that are relevant to the SV discovery exercise.

ln -s ~/CourseData/HT_data/Module6/* .

Note: The ln -s command adds symbolic links of all of the files contained in the (read-only) ~/CourseData/HT_data/Module6 directory.

In this step, we created a new directory that will store all of the files created in this lab. The ln -s command adds symbolic links of all of the files contained in the ~/CourseData sub-directories that are relevant to the SV module. This directory contains all of the alignment files (in BAM format, as you will recall) and genome annotations that we’ll need during the lab.

At this point you should have the following files:

ubuntu@ip-10-164-192-186:~/workspace/module6$ ls
bam  delly_call  fastq  pairend_distro.py  reference  RunModule6.sh

Looking in the bam directory, you should see:

ubuntu@ip-10-164-192-186:~/workspace/module6$ ls bam
backup                                    NA12878_S1.chr20.20X.pairs.posSorted.bam.bai
discordants                               NA12891_S1.chr20.20X.pairs.posSorted.bam
NA12878.molelculo.chr20.bam               NA12891_S1.chr20.20X.pairs.posSorted.bam.bai
NA12878.molelculo.chr20.bam.bai           NA12892_S1.chr20.20X.pairs.posSorted.bam
NA12878.pacbio.chr20.bam                  NA12892_S1.chr20.20X.pairs.posSorted.bam.bai
NA12878.pacbio.chr20.bam.bai              README.md
NA12878_S1.chr20.20X.pairs.posSorted.bam  validated

Align DNA with BWA-MEM

This step has been done for you in the interest of time, but the commands are shown so that you can reproduce the results later. The advantage of using BWA-MEM in the context of SV discovery is that it produces both paired-end and split-read alignments in a single BAM output file. In contrast, prior to BWA-MEM, one typically had to use two different aligners in order to produce both high quality paired-end and split-read alignments.

In the alignment commands, note the use of the -M parameter to mark shorter split hits as secondary.

 # Align NA12878 #
 # bwa mem hg19.fa \
 #         fastq/NA12878_S1.chr20.20X.1.fq \
 #         fastq/NA12878_S1.chr20.20X.2.fq \
 #         -M \
 #   | samtools view -S -b - \
 #   > bam/backup/NA12878_S1.chr20.20X.pairs.readSorted.bam
 # Align NA12891 #
 # bwa mem hg19.fa \
 #         fastq/NA12891_S1.chr20.20X.1.fq \
 #         fastq/NA12891_S1.chr20.20X.2.fq \
 #         -M \
 #   | samtools view -S -b - \
 #   > bam/backup/NA12891_S1.chr20.20X.pairs.readSorted.bam
 # Align NA12892 #
 # bwa mem hg19.fa \
 #         fastq/NA12892_S1.chr20.20X.1.fq \
 #         fastq/NA12892_S1.chr20.20X.2.fq \
 #         -M \
 #   | samtools view -S -b - \
 #   > bam/backup/NA12892_S1.chr20.20X.pairs.readSorted.bam

Characterize the fragment size distribution

We have used BWA-MEM to align all of the Illumina paired-end sequences (in FASTQ format) to the human genome. Before we can attempt to identify structural variants via discordant alignments, we must first characterize the fragment size distribution — this describes the size of concordant (i.e., they align in the expected orientation and with the expected distance between the ends) alignments, and the corollary is that we can also use the size distribution to decide the size threshold for discordant alignments. The following script, taken from the distribution of LUMPY extracts F/R pairs from a BAM file and computes the mean and stdev of the F/R alignments. It also generates a density plot of the fragment size distribution.

Calculation of the fragment distribution for NA12878.

samtools view bam/backup/NA12878_S1.chr20.20X.pairs.readSorted.bam \
   | ./pairend_distro.py -r 101 -X 4 -N 10000 \
   -o NA12878_S1.chr20.20X.pairs.histo \
   > NA12878_S1.chr20.20X.pairs.params

Calculation of the fragment distribution for NA12891.

samtools view bam/backup/NA12891_S1.chr20.20X.pairs.readSorted.bam \
   | ./pairend_distro.py -r 101 -X 4 -N 10000 \
   -o NA12891_S1.chr20.20X.pairs.histo \
   > NA12891_S1.chr20.20X.pairs.params

Calculation of the fragment distribution for NA12892.

samtools view bam/backup/NA12892_S1.chr20.20X.pairs.readSorted.bam \
   | ./pairend_distro.py -r 101 -X 4 -N 10000 \
   -o NA12892_S1.chr20.20X.pairs.histo \
   > NA12892_S1.chr20.20X.pairs.params

At this point you should have the following files:

ubuntu@ip-10-164-192-186:~/workspace/module6$ ls
bam         NA12878_S1.chr20.20X.pairs.histo   NA12891_S1.chr20.20X.pairs.params  pairend_distro.py
delly_call  NA12878_S1.chr20.20X.pairs.params  NA12892_S1.chr20.20X.pairs.histo   reference
fastq       NA12891_S1.chr20.20X.pairs.histo   NA12892_S1.chr20.20X.pairs.params  RunModule6.sh

Let’s take a peak at the first few lines of the histogram file that was produced:

head -n 10 NA12878_S1.chr20.20X.pairs.histo

Expected results:

0   0.0
1	0.000200461060439
2	0.000300691590659
3	0.000300691590659
4	0.000200461060439
5	0.000200461060439
6	0.000300691590659
7	0.00010023053022
8	0.00010023053022
9	0.00010023053022

Let’s use R to plot the fragment size distribution. First, launch R from the command line.


Now, within R, execute the following commands:

size_dist <- read.table('NA12878_S1.chr20.20X.pairs.histo')   
 pdf(file = "NA12878.fragment.hist.pdf")                    
 plot(size_dist[,1], size_dist[,2], type='h')         
 size_dist <- read.table('NA12891_S1.chr20.20X.pairs.histo') 
 pdf(file = "NA12891.fragment.hist.pdf")           
 plot(size_dist[,1], size_dist[,2], type='h') 
 size_dist <- read.table('NA12892_S1.chr20.20X.pairs.histo')  
 pdf(file = "NA12892.fragment.hist.pdf")               
 plot(size_dist[,1], size_dist[,2], type='h')      

At this point, you should have the following files:

ubuntu@ip-10-164-192-186:~/workspace/module6$ ls
bam                               NA12878_S1.chr20.20X.pairs.params  NA12892_S1.chr20.20X.pairs.histo
delly_call                        NA12891.fragment.hist.pdf          NA12892_S1.chr20.20X.pairs.params
fastq                             NA12891_S1.chr20.20X.pairs.histo   pairend_distro.py
NA12878.fragment.hist.pdf         NA12891_S1.chr20.20X.pairs.params  reference
NA12878_S1.chr20.20X.pairs.histo  NA12892.fragment.hist.pdf          RunModule6.sh

If successful, you should be able to access the 3 PDF files at the following URLs:


Note: Copy and paste the URL below into your browser, but change the “XX” in the URL to your student number.

Spend some time thinking about what this plot means for identifying discordant alignments.

What does the mean fragment size appear to be?

Are all 3 graphs the same?

Is that good or bad?

Run DELLY to detect SVs

For germline SVs, calling is done by sample or in small batches to increase SV sensitivity & breakpoint precision. At this point we will only call deletions. Here are the steps adapted from the DELLY readme.

Call deletions

Let’s start with NA12878:

delly call -t DEL -g reference/hg19.fa -o NA12878.bcf -x reference/hg19.excl \

.bcf files are compressed vcf files. To look at the output you can use:

bcftools view NA12878.bcf | less -S

Now NA12891 and NA12892:

delly call -t DEL -g reference/hg19.fa -o NA12891.bcf -x reference/hg19.excl \
delly call -t DEL -g reference/hg19.fa -o NA12892.bcf -x reference/hg19.excl \

Cheat: If these commands are taking too long, simply run the command cp delly_call/* .

Merge calls

We need to merge the SV sites into a unified site list:

delly merge -t DEL -m 500 -n 1000000 -o del.bcf -b 500 -r 0.5 \
  NA12878.bcf NA12891.bcf NA12891.bcf

Re-genotype in all samples

We need to re-genotype the merged SV site list across all samples. This can be run in parallel for each sample.

delly call -t DEL -g reference/hg19.fa -v del.bcf -o NA12878.geno.bcf \
  -x reference/hg19.excl bam/NA12878_S1.chr20.20X.pairs.posSorted.bam
delly call -t DEL -g reference/hg19.fa -v del.bcf -o NA12891.geno.bcf \
  -x reference/hg19.excl bam/NA12891_S1.chr20.20X.pairs.posSorted.bam
delly call -t DEL -g reference/hg19.fa -v del.bcf -o NA12892.geno.bcf \
  -x reference/hg19.excl bam/NA12892_S1.chr20.20X.pairs.posSorted.bam

Merge the new calls

Merge all re-genotyped samples to get a single VCF/BCF using bcftools merge. Also index the resulting file and create vcf file for visualization.

bcftools merge -O b -o merged.bcf NA12878.geno.bcf NA12891.geno.bcf NA12892.geno.bcf
bcftools index merged.bcf
bcftools view merged.bcf > merged.vcf

Apply a filter for germline events

delly filter -t DEL -f germline -o germline.bcf -g reference/hg19.fa merged.bcf
bcftools view germline.bcf > germline.vcf

Do you know how to look at the resulting file?

File check

At this point you should have the following files:

ubuntu@ip-10-164-192-186:~/workspace/module6$ ls
bam               NA12878.bcf.csi                    NA12891_S1.chr20.20X.pairs.params
del.bcf           NA12878.fragment.hist.pdf          NA12892.bcf
del.bcf.csi       NA12878.geno.bcf                   NA12892.bcf.csi
delly_call        NA12878.geno.bcf.csi               NA12892.fragment.hist.pdf
fastq             NA12878_S1.chr20.20X.pairs.histo   NA12892.geno.bcf
germline.bcf      NA12878_S1.chr20.20X.pairs.params  NA12892.geno.bcf.csi
germline.bcf.csi  NA12891.bcf                        NA12892_S1.chr20.20X.pairs.histo
germline.vcf      NA12891.bcf.csi                    NA12892_S1.chr20.20X.pairs.params
merged.bcf        NA12891.fragment.hist.pdf          pairend_distro.py
merged.bcf.csi    NA12891.geno.bcf                   reference
merged.vcf        NA12891.geno.bcf.csi               RunModule6.sh
NA12878.bcf       NA12891_S1.chr20.20X.pairs.histo

Setting up IGV for SV visualization

Launch IGV and load the merged calls and the germline calls using File -> Load from URL using:


Note: Once again you will need to replace XX by your student number.

Navigate to the following location to see a deletion:


Now load the bam files in the same way using:


You should see something like this:


You can try to configure IGV such that we can more clearly see the alignments that support the SV prediction.

Do you remember how to make sure that the alignments are colored by insert size and orientation?

To further validate that there is a deletion there you can also another track which was generate using the Moleculo long read technology:


Explore the SVs

Is the variant at chr20:31,310,769-31,312,959 found in each member of the trio?

What are the genotypes for each member of the trio at this locus (e.g., hemizygous, homozygous)?

What about the variant at chr20:37,054,372-37,056,562?

Does the evidence in the Moleculo track mimic the evidence in the Illumina track for NA12878?

What about chr20:42,269,896-42,278,072?

Continue exploring the data!

Overall script

NOTE: If you ever become truly lost in this lab, you can use the lab script to automatically perform all of the steps listed here. If you are logged into your CBW account, just run: ~/CourseData/HT_data/Module6/RunModule6.sh. You can also download the file if you want to bring it home with you.

(Optional) Look for other SVs

You can try using Delly to look for other types of SVs, for example using:

delly -t DUP

delly -t INV

delly -t TRA


This module is heavily based on a previous module prepared by Aaron Quinlan.

View on GitHub