Bioinformatics for Cancer Genomics Prediction Lab Install 2015

Workshop pages for students

General installation for CBW tutorial

Environment setup

For this tutorial, we now assume the environment variable $TUTORIAL_HOME has been set to an existing directory to which the user has write access.

Add tutorial binaries to the path.


Create a variable pointing to the directory of the picard tools install.


Create a variable pointing to the directory of the igvtools install.



Tutorial directory structure

Create directories for packages, binaries, reference data.

mkdir -p $TUTORIAL_HOME/packages
mkdir -p $TUTORIAL_HOME/bin
mkdir -p $TUTORIAL_HOME/refdata

Install tutorial scripts

Clone the tutorial repo so we have a copy of the tutorial scripts in a known location.

cd $TUTORIAL_HOME/packages
git clone https://bitbucket.org/dranew/cbw_tutorial.git

Install samtools

The samtools package is the most widely used software for manipulating high-throughput sequence data stored in the ‘bam’ format.

Download the source to the packages directory.

cd $TUTORIAL_HOME/packages
wget --no-check-certificate https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 -O samtools-1.2.tar.bz2
tar -xjvf samtools-1.2.tar.bz2

Build samtools and install to the $TUTORIAL_HOME/bin directory.

cd samtools-1.2
make prefix=$TUTORIAL_HOME install

Install picard tools

Picard tools is a useful set of utilities for manipulating sequence data in bam/sam format. To run picard tools you need:

With the above installed, download the picard tools java archives.

cd $TUTORIAL_HOME/packages
wget --no-check-certificate https://github.com/broadinstitute/picard/releases/download/1.130/picard-tools-1.130.zip -O picard-tools-1.130.zip
unzip picard-tools-1.130.zip

Install igv tools

The igvtools package provides utilities for preprocessing bam files for quicker viewing in IGV.

cd $TUTORIAL_HOME/packages
wget http://data.broadinstitute.org/igv/projects/downloads/igvtools_2.3.52.zip
unzip igvtools_2.3.52.zip

Install bowtie and bowtie2

Install bowtie dependent on platform.

cd $TUTORIAL_HOME/packages
if [[ `uname` == 'Linux' ]]; then
elif [[ `uname` == 'Darwin' ]]; then
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.1.1/$BOWTIE_PACKAGE.zip
cp bowtie-1.1.1/bowtie* $TUTORIAL_HOME/bin/

Install bowtie2 dependent on platform.

cd $TUTORIAL_HOME/packages
if [[ `uname` == 'Linux' ]]; then
elif [[ `uname` == 'Darwin' ]]; then
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/$BOWTIE2_PACKAGE.zip
unzip $BOWTIE2_PACKAGE.zip
cp bowtie2-2.2.5/bowtie2* $TUTORIAL_HOME/bin/

Install the gmap aligner

Install in the packages directory.

cd $TUTORIAL_HOME/packages

Download and unpack the source.

wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2014-12-29.tar.gz
tar -xzvf gmap-gsnap-2014-12-29.tar.gz

Build the source.

cd gmap-2014-12-29
./configure --prefix=$TUTORIAL_HOME
make install

Install the bwa aligner

Install in the packages directory.

cd $TUTORIAL_HOME/packages

Download and unpack the source.

wget --no-check-certificate http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.12.tar.bz2
tar -xjvf bwa-0.7.12.tar.bz2

Build the source.

cd bwa-0.7.12
cp bwa $TUTORIAL_HOME/bin/

Installation of the ChimeraScan gene fusion prediction tool

Environment setup

Add local library to python path

export PYTHONPATH=$PYTHONPATH:$TUTORIAL_HOME/lib/python2.7/site-packages

Set variable for index directory.



Install the source code

Install in the packages directory.

cd $TUTORIAL_HOME/packages

Download the chimerascan code.

wget --no-check-certificate https://chimerascan.googlecode.com/files/chimerascan-0.4.5a.tar.gz
tar -xzvf chimerascan-0.4.5a.tar.gz

Enter the chimerascan directory to build source.

cd chimerascan-0.4.5

Use sed for a minor patch for OSX compilation.

sed 's/inline/static inline/' chimerascan/pysam/samtools/ksort.h
python setup.py install --prefix $TUTORIAL_HOME

Install the required reference data files

Install the reference data in a subdirectory of the tutorial ref data.

mkdir -p $TUTORIAL_HOME/refdata/chimerascan/
cd $TUTORIAL_HOME/refdata/chimerascan/

Download the gene models from chimerascan’s google code site as specified in the instructions.

wget https://chimerascan.googlecode.com/files/hg19.ucsc_genes.txt.gz
gunzip hg19.ucsc_genes.txt.gz

Build the chimerascan indices using the chimerascan_index.py command.

python $TUTORIAL_HOME/bin/chimerascan_index.py \
    $UCSC_GENOME_FILENAME hg19.ucsc_genes.txt \

Installation of the deFuse gene fusion prediction tool

Environment setup

Set variable for the config filename and the two scripts.



Install the source code

Clone the defuse code into the packages directory.

cd $TUTORIAL_HOME/packages
git clone http://bitbucket.org/dranew/defuse.git

Change to the tools directory to build the tools from source.

cd $TUTORIAL_HOME/packages/defuse/tools

Create a copy of the default config and modify some of the entries. We will do this using sed, but equivalently you could use your favourite text editor.

mkdir $TUTORIAL_HOME/refdata/defuse/

cat $TUTORIAL_HOME/packages/defuse/scripts/config.txt \
    | sed 's#\[Where you unpacked the defuse code\]#'$TUTORIAL_HOME/packages/defuse'#' \
    | sed 's#\[Where you intend to store the reference dataset\]#'$TUTORIAL_HOME'/refdata/defuse#' \
    | sed 's#\[path of your \(.*\) binary.*\]#\1#' \

Install the required reference data files

The reference data files can be downloaded and index automatically using the create_reference_dataset.pl script.


Installation of the STAR RNA-Seq aligner


Specify the directory in which the reference genome data will be stored.


Create a variable for the location of the star-fusion perl script for easy running of star-fusion.



Clone the star repo in the packages directory.

cd $TUTORIAL_HOME/packages
git clone --recursive https://github.com/alexdobin/STAR.git

Copy the star executable.

if [[ `uname` == 'Linux' ]]; then
    cp STAR/bin/Linux_x86_64_static/STAR $TUTORIAL_HOME/bin
elif [[ `uname` == 'Darwin' ]]; then
    cp STAR/bin/MacOSX_x86_64/STAR $TUTORIAL_HOME/bin

Clone the star-fusion repo and checkout version 0.1.1.

cd $TUTORIAL_HOME/packages
git clone https://github.com/STAR-Fusion/STAR-Fusion.git
cd STAR-Fusion
git checkout v0.1.1

Install the Set::IntervalTree perl module required by star-fusion.

cd $TUTORIAL_HOME/packages
mkdir $TUTORIAL_HOME/lib
git clone https://github.com/amcpherson/Set-IntervalTree.git
cd Set-IntervalTree
make install

Create genome

Index the genome. Here we need to include the reference genome fasta and the gene models in GTF format.

STAR --runThreadN 1 \
    --runMode genomeGenerate \
    --genomeDir $STAR_GENOME_INDEX \
    --genomeFastaFiles $UCSC_GENOME_FILENAME \
    --sjdbOverhang 100

Installation of the tophat-fusion gene fusion prediction tool

Environment setup

Set the package dependent on the system, linux or osx.

if [[ `uname` == 'Linux' ]]; then
elif [[ `uname` == 'Darwin' ]]; then

Location of tophat-fusion specific gene models, ensembl but with chr prefix.



Install in the packages directory.

cd $TUTORIAL_HOME/packages

Download and extract the source code.

wget --no-check-certificate wget --no-check-certificate http://ccb.jhu.edu/software/tophat/downloads/$TOPHAT_PACKAGE.tar.gz
tar -xzvf $TOPHAT_PACKAGE.tar.gz

Build the source code and install binaries.

ln -s `pwd`/tophat2 $TUTORIAL_HOME/bin/tophat2
ln -s `pwd`/tophat-fusion-post $TUTORIAL_HOME/bin/tophat-fusion-post

Reference Genome Preparation

Install the reference gene annotations provided by tophat-fusion in the tophat-fusion specific reference data directory.

mkdir -p $TUTORIAL_HOME/refdata/tophatfusion/
cd $TUTORIAL_HOME/refdata/tophatfusion/
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
gunzip refGene.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz
gunzip ensGene.txt.gz

We require the gene models in GTF format and we will use the GTF provided by ensembl. However, we first need to add the ‘chr’ prefix to the chromosome names so we match the ucsc genome. Use the sed command to create a modified version of the ensembl gene models with the chr prefix for each chromosome. This can be done by just adding chr to the beginning of each line in the GTF file.

sed 's/^\([^#]\)/chr\1/' $ENSEMBL_GTF_FILENAME | sed 's/^chrMT/chrM/' > $TOPHAT_GTF_FILENAME

Tophat requires additional blast databases, download and install these

mkdir -p $TUTORIAL_HOME/refdata/tophatfusion/blast
cd $TUTORIAL_HOME/refdata/tophatfusion/blast
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/human_genomic.*.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz
gunzip *.gz

Installation of the Trinity RNA-Seq assembler


The trinity package contains some perl modules, add the directory containing these modules to the perl library directory, so perl knows where to find them.

export PERL5LIB=$TUTORIAL_HOME/packages/trinityrnaseq_r20140413p1/PerlLib/:$PERL5LIB


Install into tutorial packages directory.

cd $TUTORIAL_HOME/packages

Download and extract the trinity source.

wget --no-check-certificate http://sourceforge.net/projects/trinityrnaseq/files/trinityrnaseq_r20140413p1.tar.gz
tar -xzvf trinityrnaseq_r20140413p1.tar.gz

Make the executable and add a symbolic link in the bin directory.

cd trinityrnaseq_r20140413p1
ln -s `pwd`/Trinity $TUTORIAL_HOME/bin/Trinity
View on GitHub