logo

Bioinformatics for Cancer Genomics Prediction Lab Run 2015

Workshop pages for students


General Setup

Create tutorial directory structure on the instance

Create a directory in the temporary workspace and change to that directory.

mkdir /mnt/workspace/Module4/
cd /mnt/workspace/Module4/

Gene fusion prediction tools have been installed on the instance, create links to the installed binaries, libraries, and reference genome data.

ln -s /media/cbwdata/CG_data/Module4/tutorial_data/bin
ln -s /media/cbwdata/CG_data/Module4/tutorial_data/lib/
ln -s /media/cbwdata/CG_data/Module4/tutorial_data/packages
ln -s /media/cbwdata/CG_data/Module4/tutorial_data/refdata
ln -s /media/cbwdata/CG_data/Module4/tutorial_data/share

Set the $TUTORIAL_HOME environment variable so subsequent commands know the locatioin of installed binaries, libraries, and reference genome data. Also, subsequent analyses will fall into subdirectories of $TUTORIAL_HOME.

TUTORIAL_HOME=/mnt/workspace/Module4

Environment setup

For this tutorial, we now assume the environment variable $TUTORIAL_HOME has been set to an existing directory to which the user has write access.

Add tutorial binaries to the path.

export PATH=$TUTORIAL_HOME/bin:$PATH

Create a variable pointing to the directory of the picard tools install.

PICARD_DIR=$TUTORIAL_HOME/packages/picard-tools-1.130/

Create a variable pointing to the directory of the igvtools install.

IGVTOOLS_DIR=$TUTORIAL_HOME/packages/IGVTools/

Install additional libraries

R ada for deFuse

Install the additional R package ‘ada’ locally.

mkdir /mnt/workspace/Module4/rlib/

Rscript -e "install.packages('ada', \
    repos='http://cran.us.r-project.org', \
    lib='/mnt/workspace/Module4/rlib/')"

Add the local R library path to the $R_LIBS_USER environment variable. Use export so that it is available to subprocesses, in particular the R subprocess.

export R_LIBS_USER=/mnt/workspace/Module4/rlib/

Set::IntervalTree for STAR-Fusion

Install the Set::IntervalTree perl module required by star-fusion.

mkdir $TUTORIAL_HOME/perlpackages
cd $TUTORIAL_HOME/perlpackages

git clone https://github.com/amcpherson/Set-IntervalTree.git
cd Set-IntervalTree

mkdir $TUTORIAL_HOME/perllib
perl Makefile.PL PREFIX=$TUTORIAL_HOME/perllib LIB=$TUTORIAL_HOME/perllib
make
make install

Add the local perl library path to the $PERL5LIB environment variable. Use export so that it is available to perl subprocesses.

export PERL5LIB=$PERL5LIB:$TUTORIAL_HOME/perllib

Python jinja2 for chimerascan

First install the jinja2 python package locally.

pip install jinja2 -t /mnt/workspace/Module4/pythonlib/

Add the install directory to the python path environment variable so that python can find the package.

export PYTHONPATH=$PYTHONPATH:$TUTORIAL_HOME/pythonlib/

Environment setup

Create variables for the reference genome and gene models.

TUTORIAL_CHROMOSOME=1
ENSEMBL_GTF_FILENAME=$TUTORIAL_HOME/refdata/Homo_sapiens.GRCh37.75.gtf
GENCODE_GTF_FILENAME=$TUTORIAL_HOME/refdata/gencode.v19.annotation.gtf
ENSEMBL_GENOME_FILENAME=$TUTORIAL_HOME/refdata/Homo_sapiens.GRCh37.75.dna.chromosome.$TUTORIAL_CHROMOSOME.fa
UCSC_GENOME_FILENAME=$TUTORIAL_HOME/refdata/chr$TUTORIAL_CHROMOSOME.fa
ENSEMBL_GENOME_PREFIX=$TUTORIAL_HOME/refdata/Homo_sapiens.GRCh37.75.dna.chromosome.$TUTORIAL_CHROMOSOME
UCSC_GENOME_PREFIX=$TUTORIAL_HOME/refdata/chr$TUTORIAL_CHROMOSOME

For gmap, set a directory for the gmap indices, and the id of the reference.

GMAP_REF_ID=chr$TUTORIAL_CHROMOSOME
GMAP_INDEX_DIR=$TUTORIAL_HOME/refdata/gmap/

HCC1395 sample data

Environment setup

We will create environment variables pointing to the data on the instance, and set a sample id for the data.

We will be working with a publically available HCC1395 breast cancer cell line sequenced for teaching and benchmarking purposes. The dataset is available from washington university servers, and can be accessed via the github page for the Genome Modeling System.

Special thanks to Malachi Griffith for providing access to the data.

SAMPLE_ID=HCC1395
SAMPLE_FASTQ_1=/media/cbwdata/CG_data/HCC1395/rnaseq/SAMPLE_R1.fastq
SAMPLE_FASTQ_2=/media/cbwdata/CG_data/HCC1395/rnaseq/SAMPLE_R2.fastq

Running times for each tool

chimerascan: 12 minutes

defuse: 18 minutes

star: 11 minutes

tophatfusion: 17 minutes

trinity: 12 minutes

Run the ChimeraScan fusion prediction tool

Environment setup

Add local library to python path

export PYTHONPATH=$PYTHONPATH:$TUTORIAL_HOME/lib/python2.7/site-packages

Set variable for index directory.

CHIMERASCAN_INDEX=$TUTORIAL_HOME/refdata/chimerascan/indices/

Execution

Run chimerascan by providing the index directory, the pair of fastq files, and the sample specific output directory

mkdir -p $TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/
python $TUTORIAL_HOME/bin/chimerascan_run.py -p 4 \
    $CHIMERASCAN_INDEX \
    $SAMPLE_FASTQ_1 $SAMPLE_FASTQ_2 \
    $TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/

After chimerascan has completed we can run an additional script to create an html output to browse the results.

python $TUTORIAL_HOME/bin/chimerascan_html_table.py --read-throughs \
    -o $TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/chimeras.html \
    $TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/chimeras.bedpe

Run the deFuse gene fusion prediction tool

Environment setup

Set variable for the config filename and the two scripts.

DEFUSE_CONFIG=$TUTORIAL_HOME/refdata/defuse/config.txt
CREATEREF_SCRIPT=$TUTORIAL_HOME/packages/defuse/scripts/create_reference_dataset.pl
DEFUSE_SCRIPT=$TUTORIAL_HOME/packages/defuse/scripts/defuse.pl
DEFUSE_SCRIPT_DIR=$TUTORIAL_HOME/packages/defuse/scripts/

Execution

Running deFuse involves invoking a single script using perl. The output is to a directory which will contain a number of temporary and results files. Create an output directory and run the defuse script specifying the paired end reads and the configuration filename.

mkdir -p $TUTORIAL_HOME/analysis/defuse/$SAMPLE_ID/

perl $DEFUSE_SCRIPT \
    -c $DEFUSE_CONFIG \
    -1 $SAMPLE_FASTQ_1 \
    -2 $SAMPLE_FASTQ_2 \
    -o $TUTORIAL_HOME/analysis/defuse/$SAMPLE_ID/

Run the STAR RNA-Seq aligner on a sample

Environment

Specify the directory in which the reference genome data will be stored.

STAR_GENOME_INDEX=$TUTORIAL_HOME/refdata/star/

Create a variable for the location of the star-fusion perl script for easy running of star-fusion.

STAR_FUSION_SCRIPT=$TUTORIAL_HOME/packages/STAR-Fusion/STAR-Fusion

Execution

Run the star aligner

Running STAR on paired end read fastqs can be performed in a single step. Set the prefix argument --outFileNamePrefix to specify the location of the output.

Assume we have end 1 and end 2 files as $SAMPLE_FASTQ_1 and $SAMPLE_FASTQ_2 environment variables pointing to paired fastq files for sample $SAMPLE_ID.

mkdir -p $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/
STAR --runThreadN 4 \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMstrandField intronMotif \
    --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
    --outReadsUnmapped None \
    --chimSegmentMin 15 \
    --chimJunctionOverhangMin 15 \
    --chimOutType WithinBAM \
    --alignMatesGapMax 200000 \
    --alignIntronMax 200000 \
    --genomeDir $STAR_GENOME_INDEX \
    --readFilesIn $SAMPLE_FASTQ_1 $SAMPLE_FASTQ_2 \
    --outFileNamePrefix $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/

Note: In order to obtain alignments of chimeric reads potentially supporting fusions, we have added the --chimSegmentMin 20 option to obtain chimerica reads anchored by at least 20nt on either side of the fusion boundary, and --chimOutTypeWithinBAM to report such alignments in the sam/bam output.

Note: We are running with additional command line options for fusion discovery as described in the manual for STAR-Fusion.

The file $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam should contain the resulting alignments in bam format, sorted by position.

Index the bam file using samtools index.

samtools index $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam

Generate a precalculated coverage file for use with IGV. This will allow viewing of read depth across the genome at all scales, and will speed up IGV viewing of the bam file.

$IGVTOOLS_DIR/igvtools count $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam \
    $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam.tdf hg19

You can view the bam file in IGV at:

http://cbw#.dyndns.info/Module4/analysis/star/HCC1395/Aligned.sortedByCoord.out.bam

where you need to replace # with your student ID.

Run STAR fusion caller

The STAR fusion caller parses the chimeric alignments produced by the STAR aligner and predicts fusions from these alignments.

perl $STAR_FUSION_SCRIPT \
    --chimeric_out_sam $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.sam \
    --chimeric_junction $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.junction \
    --ref_GTF $GENCODE_GTF_FILENAME \
    --out_prefix $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/starfusion

The fusion reads are output to the sam file Chimeric.out.sam. To view these in IGV, first import into bam format, sort, and then index.

samtools view -bt $UCSC_GENOME_FILENAME \
    $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.sam \
    | samtools sort - $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out
samtools index $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam
$IGVTOOLS_DIR/igvtools count $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam \
    $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam.tdf hg19

You can view the bam file of fusion reads in IGV at:

http://cbw#.dyndns.info/Module4/analysis/star/HCC1395/Chimeric.out.bam

where you need to replace # with your student ID. Fusion reads can be found at this location chr1:176,901,821-176,904,807.

Run the tophat-fusion gene fusion prediction tool

Environment setup

Set the package dependent on the system, linux or osx.

if [[ `uname` == 'Linux' ]]; then
    TOPHAT_PACKAGE=tophat-2.0.14.Linux_x86_64
elif [[ `uname` == 'Darwin' ]]; then
    TOPHAT_PACKAGE=tophat-2.0.14.OSX_x86_64
fi

Location of tophat-fusion specific gene models, ensembl but with chr prefix.

TOPHAT_GTF_FILENAME=$TUTORIAL_HOME/refdata/tophatfusion/Homo_sapiens.GRCh37.75.gtf

Execution

Running Tophat-fusion is a two step process. First we run tophat2 to do spliced alignments, adding the --fusion-search option to ensure we are looking for fusion splice alignments. Remember to use the prefix of the reference genome, rather than the fasta filename.

Run the tophat2 step specifying a sample specific output directory.

mkdir -p $TUTORIAL_HOME/analysis/tophatfusion/$SAMPLE_ID/

tophat2 -p 4 \
    --mate-inner-dist 100 \
    --mate-std-dev 30 \
    --fusion-search \
    -G $TOPHAT_GTF_FILENAME \
    -o $TUTORIAL_HOME/analysis/tophatfusion/$SAMPLE_ID/ \
    $UCSC_GENOME_PREFIX \
    $SAMPLE_FASTQ_1 $SAMPLE_FASTQ_2

The second step requires us to soft link some of the reference data into a directory with a specific structure. We then run the tophat-fusion post-processing step.

mkdir -p $TUTORIAL_HOME/analysis/tophatfusion/$SAMPLE_ID/tophat_$SAMPLE_ID
cd $TUTORIAL_HOME/analysis/tophatfusion/$SAMPLE_ID/tophat_$SAMPLE_ID

ln -s $TUTORIAL_HOME/refdata/tophatfusion/blast
ln -s $TUTORIAL_HOME/refdata/tophatfusion/refGene.txt
ln -s $TUTORIAL_HOME/refdata/tophatfusion/ensGene.txt

tophat-fusion-post \
    --num-fusion-reads 1 \
    --num-fusion-pairs 2 \
    --num-fusion-both 5 \
    $UCSC_GENOME_PREFIX

Run the Trinity RNA-Seq assembler / GMAP contig mapper

Environment

The trinity package contains some perl modules, add the directory containing these modules to the perl library directory, so perl knows where to find them.

export PERL5LIB=$TUTORIAL_HOME/packages/trinityrnaseq_r20140413p1/PerlLib/:$PERL5LIB

Execution

Assemble using Trinity

Create a directory for the trinity analysis. Run trinity with fastq (fq) as the sequence type. Specify memory and threads, and provide the fastq paths and output paths.

mkdir -p $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/

cp $SAMPLE_FASTQ_1 $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_1.fq
cp $SAMPLE_FASTQ_2 $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_2.fq

Trinity \
    --seqType fq --JM 12G --CPU 4 \
    --left $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_1.fq \
    --right $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_2.fq \
    --output $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/

Align using GMap

Use gmap to map the resulting contigs to the reference genome and produce a GFF3 file.

cd $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/

gmap \
    -f gff3_gene \
    -D $GMAP_INDEX_DIR \
    -d $GMAP_REF_ID \
    Trinity.fasta \
    > Trinity.gff3

Postprocess

Post-process the gff3 file to produce a list of fusions. We will use a custom script to simply pull out chimeric contigs. Realistically, further processing would be required to identify true fusions.

cd $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/

python $TUTORIAL_HOME/packages/cbw_tutorial/gene_fusions/scripts/gmap_extract_fusions.py \
    Trinity.gff3 Trinity.fasta Fusions.fasta

The following sequence is the ASTN1-FAM5B fusion.

>c107_g1_i2 len=396 path=[1:0-245 247:246-395]
GTTTAGAGGGCTTCGGCCGGGGATGGCCCCATGGACAGCCCTGCTGGCACTGGGCCTGCC
TGGCTGGGTGTTGGCTGTCTCAGCCACGGCGGCTGCTGTGGTCCCCGAGCAGCATGCCTC
CGTAGCTGGCCAGCATCCCCTGGACTGGCTGCTCACAGACCGGGGCCCCTTCCACCGCGC
TCAGGAGTATGCTGACTTCATGGAGCGGTACCGCCAGGGTTTCACCACCAGGTACAGGAT
TTATAGCCAGAGTCGCCGCTGGACCTGCTTGTTTGGGTACCAGATAGAACAGGACTCCTC
AAAGCCGTAGATAGCTTCCTGGATGTAATGGTTGCCGAACTGGTCCAACAGCGCCACAAA
ATCTGCACGAGATGTAGCCCCATCCAGCGAGTGGAG
View on GitHub