Module 2 Lab: ChIP-Seq Alignment, Peak Calling (Enrichment Regions Detection), and Visualisation

By Misha Bilenky

Reference Genomes

Genomes for different species can be found at:

On Guillimin

cd /cvmfs/ref.mugqic/genomes/species/
ls -l

file list

cd ./Homo_sapiens.GRCh37/genome
ls -l

Which are the genome fasta file and index?

Using BWA

BWA executable is installed for you

Let’s define a variable

BWA=/cvmfs/soft.mugqic/CentOS6/software/bwa/bwa-0.7.12/bwa

Try to run $BWA without parameters

What happens?

Alignment Example

Toy example:

49990 reads
1.7M file size

located at:

/gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/H1/data/H3K27ac/H3K27ac.H1.fastq.gz

Setting up the Variables

data=/gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/H1/data/
genome=/cvmfs/ref.mugqic/genomes/species/Homo_sapiens.GRCh37/genome/bwa_index/Homo_sapiens.GRCh37.fa
mark=H3K27ac
f=$data/$mark/$mark.H1.fastq.gz
out=~/test/H3K27ac; mkdir -p $out
n=$mark.H1

Note: By changing H3K27ac to H3K36me3 you will align for another mark

Alignment Step: bwa aln

Try:

$BWA aln

You will see all the options/parameters and default values

Important parameters

Seed length

-l INT seed length [32]

Number of Mismatches

-k INT maximum differences in the seed [2]

Try:

$BWA aln $genome $f > $out/$n.sai

The .sai file is an intermediate file containing the suffix array indexes. Such file is afterwards translated into a SAM file.

NOTE if we align data from pair-end experiment we need to do alignment of both read1 and read2

As an example:

$BWA aln $genome $f1 > $out/$n1.sai
$BWA aln $genome $f2 > $out/$n2.sai

Alignment Step: Translation into SAM file

$BWA samse -f $out/$n.sam $genome $out/$n.sai $f

Resulting is a SAM file with a proper header etc.

To see the file:

less $out/$n.sam

To quit less, press q.

NOTE in case of pair end data

$BWA sampe [options] <genome> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > out.sam

Alignment Step: Conversion of SAM to BAM and Sorting

Let’s define samtools variable:

SAMTOOLS=/cvmfs/soft.mugqic/CentOS6/software/samtools/samtools-1.3/bin/samtools

And convert the SAM to BAM and sort the BAM

$SAMTOOLS view -Sb $out/$n.sam > $out/$n.bam
$SAMTOOLS sort $out/$n.bam > $out/$n.sorted.bam

This was position sorting; Option -n gives name sorted BAM file

Now we can view bam file

$SAMTOOLS view -h $out/$n.sorted.bam | more

Check the size of SAM/BAM files

ls -lh $out

We can delete intermediate files (sai sam and unsorted bam):

rm $out/*.sa*
rm $out/$n.bam 

Alignment Step: Marking Duplicates with Picard

Define location of PICARD jars:

PICARD=/cvmfs/soft.mugqic/CentOS6/software/picard/picard-tools-1.123/

Mark the duplicates

cd $out
java -jar -Xmx20G $PICARD/MarkDuplicates.jar I=$out/$n.sorted.bam O=$out/$n.sorted.dupsMarked.bam M=dups AS=true VALIDATION_STRINGENCY=LENIENT QUIET=true

Alignment Step: BAM file Statistics and Indexing

$SAMTOOLS flagstat $out/$n.sorted.dupsMarked.bam > $out/$n.flagstat
less $out/H3K27ac.H1.flagstat

Flagstat contains different BAM file statistics; check the file

$SAMTOOLS index $out/$n.sorted.dupsMarked.bam

Index $out/$n.sorted.dupsMarked.bam.bai file is generated; it allows samtool access BAM file from a given location

Alignment Step: Generating wig File

BAM2WIG java tool

Define:

BAM=$out/H3K27ac.H1.sorted.dupsMarked.bam
SAMTOOLS=/cvmfs/soft.mugqic/CentOS6/software/samtools/samtools-0.1.19/samtools 

BIN=/gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/bin/

java -jar -Xmx2G $BIN/BAM2WIG.jar -bamFile $BAM -out $out -q 5 -F 1028 -cs -x 150 -samtools $SAMTOOLS > $out/wig.log

cat $out/wig.log

zcat $out/H3K27ac.H1.sorted.dupsMarked.q5.F1028.SET_150.wig.gz | head

Note: if wig.log says Output dir is not defined try retyping the java command (do not copy and paste it).

Alignment Step: Visualization

We would need to add a track header to the file

cd $out
gunzip *.wig.gz
H=/gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/H1/data/H3K27ac/wig_track_header
cp $H $out/test.wig
less $out/H3K27ac.H1.sorted.dupsMarked.q5.F1028.SET_150.wig >> $out/test.wig
gzip $out/test.wig
zcat $out/test.wig.gz | head

Move test.wig to your local computer using WinSCP or scp.

In your browser, open http://genome.ucsc.edu

Select genomes and choose hg19.

Click the Manage custom tracks button.

Add custom track and browse to $OUT/test.wig.gz

Region of interest

Browse to the region of interest: chr3:43375889-45912052

Enriched Regions

We should use full size bam files in order to call enrichments (or at least bam file for a whole chromosome) as tools infer different distributions from the data and we need to have enough statistics

Run FindER on H3K27ac (just chr3)

First, set some variables.

mark=H3K27ac
dir=/gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/H1/
sig=$dir/original/chr3/$mark.H1.bam
inp=$dir/original/chr3/Input.H1.bam
BIN=/gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/bin/
out=~/test/FindER; mkdir -p $out
SAMTOOLS=/cvmfs/soft.mugqic/CentOS6/software/samtools/samtools-0.1.19/samtools

Now, run the command.

java -jar -Xmx10G $BIN/FindER_1_0_0.jar -signalBam $sig -inputBam $inp -SE -xsetI 150 -xsetS 150 -bin 150 -v -out $out -samtools $SAMTOOLS -regions 3 &> $out/$mark.log

Run MACS2

Setting up Python environment

export PATH=/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.8/bin:/cvmfs/soft.mugqic/CentOS6/software/MACS2/MACS2-2.1.0.20151222/bin:$PATH
export PYTHONPATH=/cvmfs/soft.mugqic/CentOS6/software/MACS2/MACS2-2.1.0.20151222/lib/python2.7/site-packages:$PYTHONPATH

out=~/test/macs2; mkdir -p $out

macs2 callpeak -t $sig -c $inp -f BAM -g hs -n $out/$mark.v.Input.chr3 -B -q 0.01 &> $out/$mark.v.Input.chr3.log

Running macs in a standard (narrow peak mode) with a q-value threshold 0.01

Broad mode (for H3K36me3, H3K27me3, for example)

For example:

macs2 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1
# where ChIP.bam is your ChIP-Seq bam and Control.bam is your control input bam

Load MACS2/FindER track to UCSC

Post-process mac2 file

cd $out
cp /gs/project/mugqic/bioinformatics.ca/epigenomics/chip-seq/H1/macs2/macs_track macs2.bed
less H3K27ac.v.Input.chr3_peaks.narrowPeak | cut -f1-3 | awk '{print "chr"$0}' >> macs2.bed

Move both files to your local computer using WinSCP or scp.

Now load both files into UCSC

macs output file: macs2.bed

FindER output file: H3K27ac.H1.vs.Input.H1.bin_150.FDR_0.05.FindER.bed.gz

You should now see BED tracks.

BED added

View on GitHub

Epigenomic Data Analysis 2016

Module 2 Lab