Informatics on High-Throughput Data 2015

Workshop pages for students

Laptop Setup Instructions

Instructions for setting up your laptop can be found here: Laptop Setup Instructions

Pre-Workshop Tutorials

1) R Preparation tutorials: You are expected to have completed the following tutorials in R beforehand. The tutorial should be very accessible even if you have never used R before.

2) UNIX Preparation tutorials:

3) IGV Tutorial: Review how to use IGV Genome Browser if you have not used this tool before.

Pre-Workshop Readings

Using cloud computing infrastructure with CloudBioLinux, CloudMan, and Galaxy

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Genome structural variation discovery and genotyping

A survey of sequence alignment algorithms for next-generation sequencing

Genotype and SNP calling from next-generation sequencing data

Logging into the Amazon cloud

Instructions can be found here.

  • These instructions will ONLY be relevant in class, as the Cloud will not be accessible from home in advance of the class.

Day 1


*Faculty: Michelle Brazas*

Module 1: Overview of HT-sequencing & Cloud Computing

*Faculty: Zhibin Lu*




Module 2: Reference-guided Genome Alignment

*Faculty: Matei David*



Lab Practical:

Reference Guided Genome Alignment Lab practical

Discussion questions

Data set:

After the workshop: You can download the data set from here. You may also need download the reference genome if you do not have one to do the lab practice on your own machine.

Programs used:

Links to Additional Resources:

Module 3: Data Visualization

*Faculty: Sorana Morrissy*



Lab Practical:

Programs used:

Module 4: De Novo Assembly

*Faculty: Jared Simpson*



Integrated Assignment for Day 1

*Faculty: Sorana Morrissy*

Review the techniques learned in Modules 1-3. An additional dataset (fastq file) has been provided here for this purpose.

# Create a directory to work in:
# this is where we'll place all of our output files
mkdir -p ~/workspace/Integrated_assignment
cd ~/workspace/Integrated_assignment
# Erase any files that might already be there
rm *
# Create symbolic links for all of the files contained in the Module2 directory
# this includes the hg19 genome, the FASTQ files, and dbSNP annotation
ln -s ~/CourseData/HT_data/Module2/* .

Task list:

  1. Align the raw data to the human reference genome.
  2. Sort the reads and perform duplicate removal.
  3. Index the sorted bam file.
  4. Perform indel cleaning.
  5. Visualize the alignments.


  1. Explain the purpose of each step.
  2. Which software tool can be used for each step.

Integrated Assignment: IA_Question_Answers_2015.txt

Day 2

Module 5: Small variant calling & annotation

*Faculty: Guillaume Bourque*



Lab Practical:

Lab directions

VCF format


A great resource for putting together a GATK-based variant calling pipeline is the GATK Best practices page. This page will guide you in your quest to produce the best variant calls possible using GATK.

**Pro-tip 2:**

Another useful program for generating summary statistics on vcf files, filtering vcf files, and comparing multiple vcf files is vcftools.

Data set:

After the workshop: You can download the data set from here to your local machine and work from there.

Programs used:

Module 6: Structural variation calling

*Faculty: Guillaume Bourque*



Lab Practical:

Lab directions

Data set:

After the workshop: You can download the data set from here to your local machine and work from there.

Programs used:

Module 7: Bringing it all Together: Galaxy

*Faculty: Francis Ouellette*



Lab Practical:


Dataset for the Galaxy lab:

In Galaxy, under Get Data and Upload File in the URL box:

NA12878_CBW_chr1_R1.fastq.gz http://cbwxx.dyndns.info/module2/NA12878_CBW_chr1_R1.fastq.gz
NA12878_CBW_chr1_R2.fastq.gz http://cbwxx.dyndns.info/module2/NA12878_CBW_chr1_R2.fastq.gz
hg19_chr1.fa http://cbwxx.dyndns.info/module7/hg19_chr1.fa
dbSNP_135_chr1.vcf.gz http://cbwxx.dyndns.info/module2/dbSNP_135_chr1.vcf.gz

Note: xx is your student number.

Galaxy workflow part 1 (cloud): Galaxy-Workflow-CBW Galaxy lab part1 Alignment Variant calling.ga
Galaxy workflow part 2 (main instance): Galaxy-Workflow-CBW Galaxy lab part2 VariantFiltration Annotation.ga

What you need for the lab:

You will need an account on Galaxy so that you can run tools in their environment.

Galaxy Resources:

galaxyproject.org: Galaxy home page
usegalaxy.org: main Galaxy public server
getgalaxy.org: source for installing local Galaxy
usegalaxy.org/cloud: use galaxy in the cloud
* Example of a Galaxy pipeline (we used for an RNASeq lab last year. Save file as: Galaxy-Workflow-Module_5_workflow_from_Emilie_Chautard_and_Francis.ga 
Galaxy 101 worked example
Galaxy servers throughout the world
Published (read: Public) pages

Tips, tricks, and resources

Data Sets from Entire Workshops

Reference Genome for HT-seq
Module 2/7/HT-seq Integrated Data
Module 5 Data
Module 6 Data

Results from Instructor’s Instance on Amazon

Module 2 result
Module 5 result
Module 6 result

Tools with installation instructions on our Amazon server

Instructions for installing the tools used in the workshops can be found here.

Launching CBW AMI

Steps to launch CBW public AMI

  • AMI ID: ami-b9a253d2
  • AMI Name: CBW workshops 2015
View on GitHub