Informatics on High-Throughput Sequencing Data 2016

Workshop Q/A Forum

Post your workshop questions here!

Workshop Survey

We value your feedback. Please fill out our survey to help us make our workshops better.

Class Photo

Class photo

Laptop Setup Instructions

Instructions to setup your laptop can be found here.

Pre-workshop Tutorials

1) R Preparation tutorials: You are expected to have completed the following tutorials in R beforehand. The tutorial should be very accessible even if you have never used R before.

2) UNIX Preparation tutorials: Please complete tutorials #1-3 on UNIX at http://www.ee.surrey.ac.uk/Teaching/Unix/

Unix Cheat sheet

3) IGV Tutorial: Review how to use IGV Genome Browser if you have not used this tool before.

IGV tutorial

Pre-workshop Readings

Before coming to the workshop, read these.

Logging into the Amazon Cloud

Instructions can be found here.

We have set up 30 instances on the Amazon cloud - one for each student. In order to log in to your instance, you will need a security certificate. If you plan on using Linux or Mac OS X, please download this certificate. Otherwise if you plan on using Windows (with Putty and Winscp), please download this certificate.

Recorded Lectures’ Playlist

YouTube Playlist for Recorded Lectures

Day 1

Welcome

Ann Meyer

Module 1: Introduction to HT-sequencing and Cloud Computing

Zhibin Lu

Module 2: Genome Alignment

Mathieu Bourgey

Lab practical

Programs:

Additional Resources:

Module 3: Genome Visualization

Florence Cavalli

Module 4: De Novo Assembly

Jared Simpson

Integrated Assignment script

Integrated Assignment

Florence Cavalli

We will perform the same analysis as in Module 2 but using the mother and father samples i.e sample NA12891 and NA12891.

Files are in the following directory of the cloud instance: ~/CourseData/HT_data/Module2/

 * raw_reads/NA12891_CBW_chr1_R1.fastq.gz
 * raw_reads/NA12891_CBW_chr1_R2.fastq.gz
 * raw_reads/NA12892_CBW_chr1_R1.fastq.gz
 * raw_reads/NA12892_CBW_chr1_R2.fastq.gz

#set up
export ROOT_DIR=~/workspace/Integrated_assignment
export TRIMMOMATIC_JAR=$ROOT_DIR/tools/Trimmomatic-0.36/trimmomatic-0.36.jar
export PICARD_JAR=$ROOT_DIR/tools/picard-tools-1.141/picard.jar
export GATK_JAR=$ROOT_DIR/tools/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar
export BVATOOLS_JAR=$ROOT_DIR/tools/bvatools-1.6/bvatools-1.6-full.jar
export REF=$ROOT_DIR/reference/

# Create a directory to work in (workspace/Integrated_assignment)
# this is where we'll place all of our output files

mkdir -p $ROOT_DIR
cd $ROOT_DIR

# Erase any files that might already be there</b>
 rm *
 
# Create symbolic links for all of the files contained in the Module2 directory
# this includes the hg19 genome and the FASTQ files
ln -s ~/CourseData/HT_data/Module2/* .
ls

Task list:

Check read QC
Trim unreliable bases from the read ends
Align the reads to the reference
Sort the alignments by chromosome position
Realign short indels
Fixe mate issues
Recalibrate the Base Quality
Generate alignment metrics

Discussion/Questions:

Explain the purpose of each step
Which software tool can be used for each step

Day 2

Module 5: Genome Variation

Guillaume Bourque

Pro-tip: A great resource for putting together a GATK-based variant calling pipeline is the GATK Best practices page. This page will guide you in your quest to produce the best variant calls possible using GATK.

Pro-tip 2: Another useful program for generating summary statistics on vcf files, filtering vcf files, and comparing multiple vcf files is vcftools.

Programs:

Module 6: Genome Structural Variation

Guillaume Bourque

Programs:

Module 7: Bringing it Together with Galaxy

David Morais

Lab practical

Data set:

NA12878_CBW_chr1_R1.fastq.gz
http://cbw##.dyndns.info/HTSeq_module2/raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz

NA12878_CBW_chr1_R2.fastq.gz
http://cbw##.dyndns.info/HTSeq_module2/raw_reads/NA12878/NA12878_CBW_chr1_R2.fastq.gz

hg19_chr1.fa
http://cbw##.dyndns.info/Module7/hg19_chr1.fa

dbSNP_135_chr1.vcf.gz
http://cbw##.dyndns.info/HTSeq_module2/reference/dbSNP_135_chr1.vcf.gz

Note: ## is your student number.

Galaxy workflow part 1 (cloud):

Galaxy workflow part 2 (main instance):

What you need for the lab:

Galaxy public server
An account on Galaxy to run tools in their environment.

Galaxy Resources:

Galaxy home page
Galaxy public server
Source for installing local Galaxy
Galaxy in the Cloud
Example of Galaxy pipeline put example here
Galaxy 101 worked example
Galaxy servers throughout the world
Published pages