Laptop Setup Instructions
Instructions for setting up your laptop can be found here: Laptop Setup Instructions
Pre-Workshop Tutorials
1) R Preparation tutorials: You are expected to have completed the following tutorials in R beforehand. The tutorial should be very accessible even if you have never used R before.
- The R Tutorial up to and including 5. Basic Plots
- The R command cheat sheet
2) UNIX Preparation tutorials:
- UNIX Bootcamp
- Tutorials #1-3 on UNIX Tutorial for Beginners
- Unix Cheat sheet
3) IGV Tutorial: Review how to use IGV Genome Browser if you have not used this tool before.
- The IGV Tutorial
Pre-Workshop Readings
Using cloud computing infrastructure with CloudBioLinux, CloudMan, and Galaxy
Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
Genome structural variation discovery and genotyping
A survey of sequence alignment algorithms for next-generation sequencing
Genotype and SNP calling from next-generation sequencing data
Logging into the Amazon cloud
Instructions can be found here.
- These instructions will ONLY be relevant in class, as the Cloud will not be accessible from home in advance of the class.
Day 1
Welcome
*Faculty: Michelle Brazas*Module 1: Overview of HT-sequencing & Cloud Computing
*Faculty: Zhibin Lu*Lecture:
HT-seq2015_Module1.pdf
HT-seq2015_Module1.ppt
HT-seq2015_Module1.mp4
Module 2: Reference-guided Genome Alignment
*Faculty: Matei David*Lecture:
HT-seq2015_Module2.pdf
HT-seq2015_Module2.mp4
Lab Practical:
Reference Guided Genome Alignment Lab practical
Data set:
After the workshop: You can download the data set from here. You may also need download the reference genome if you do not have one to do the lab practice on your own machine.
Programs used:
Links to Additional Resources:
- SEQanswers bioinformatics forum
- SAM/BAM file format specification
- Paired end vs mate pair reads
- Base qualities vs mapping qualities
- The decoy genome
Module 3: Data Visualization
*Faculty: Sorana Morrissy*Lecture:
HT-seq2015_Module3.pdf
HT-seq2015_Module3.ppt
HT-seq2015_Module3.mp4
Lab Practical:
Programs used:
Module 4: De Novo Assembly
*Faculty: Jared Simpson*Lecture:
HT-seq2015_Module4.pdf
HT-seq2015_Module4.mp4
Integrated Assignment for Day 1
*Faculty: Sorana Morrissy*Review the techniques learned in Modules 1-3. An additional dataset (fastq file) has been provided here for this purpose.
# Create a directory to work in:
# this is where we'll place all of our output files
mkdir -p ~/workspace/Integrated_assignment
cd ~/workspace/Integrated_assignment
# Erase any files that might already be there
rm *
# Create symbolic links for all of the files contained in the Module2 directory
# this includes the hg19 genome, the FASTQ files, and dbSNP annotation
ln -s ~/CourseData/HT_data/Module2/* .
ls
Task list:
- Align the raw data to the human reference genome.
- Sort the reads and perform duplicate removal.
- Index the sorted bam file.
- Perform indel cleaning.
- Visualize the alignments.
Discussion/Questions:
- Explain the purpose of each step.
- Which software tool can be used for each step.
Integrated Assignment: IA_Question_Answers_2015.txt
Day 2
Module 5: Small variant calling & annotation
*Faculty: Guillaume Bourque*Lecture:
HT-seq2015_Module5.pdf
HT-seq2015_Module5.ppt
HT-seq2015_Module5.mp4
Lab Practical:
**Pro-tip:**A great resource for putting together a GATK-based variant calling pipeline is the GATK Best practices page. This page will guide you in your quest to produce the best variant calls possible using GATK.
**Pro-tip 2:**Another useful program for generating summary statistics on vcf files, filtering vcf files, and comparing multiple vcf files is vcftools.
Data set:
After the workshop: You can download the data set from here to your local machine and work from there.
Programs used:
Module 6: Structural variation calling
*Faculty: Guillaume Bourque*Lecture:
HT-seq2015_Module6.pdf
HT-seq2015_Module6.ppt
HT-seq2015_Module6.mp4
Lab Practical:
Data set:
After the workshop: You can download the data set from here to your local machine and work from there.
Programs used:
Module 7: Bringing it all Together: Galaxy
*Faculty: Francis Ouellette*Lecture:
HT-seq2015_Module7.pdf
HT-seq2015_Module7.ppt
HT-seq2015_Module7.mp4
Lab Practical:
Dataset for the Galaxy lab:
In Galaxy, under Get Data and Upload File in the URL box:
NA12878_CBW_chr1_R1.fastq.gz
http://cbwxx.dyndns.info/module2/NA12878_CBW_chr1_R1.fastq.gz
NA12878_CBW_chr1_R2.fastq.gz
http://cbwxx.dyndns.info/module2/NA12878_CBW_chr1_R2.fastq.gz
hg19_chr1.fa
http://cbwxx.dyndns.info/module7/hg19_chr1.fa
dbSNP_135_chr1.vcf.gz
http://cbwxx.dyndns.info/module2/dbSNP_135_chr1.vcf.gz
Note: xx is your student number.
Galaxy workflow part 1 (cloud): Galaxy-Workflow-CBW Galaxy lab part1 Alignment Variant calling.ga
Galaxy workflow part 2 (main instance): Galaxy-Workflow-CBW Galaxy lab part2 VariantFiltration Annotation.ga
What you need for the lab:
- Galaxy public server: https://usegalaxy.org/
You will need an account on Galaxy so that you can run tools in their environment.
Galaxy Resources:
* galaxyproject.org: Galaxy home page
* usegalaxy.org: main Galaxy public server
* getgalaxy.org: source for installing local Galaxy
* usegalaxy.org/cloud: use galaxy in the cloud
* Example of a Galaxy pipeline (we used for an RNASeq lab last year. Save file as: Galaxy-Workflow-Module_5_workflow_from_Emilie_Chautard_and_Francis.ga
* Galaxy 101 worked example
* Galaxy servers throughout the world
* Published (read: Public) pages
Tips, tricks, and resources
Data Sets from Entire Workshops
* Reference Genome for HT-seq
* Module 2/7/HT-seq Integrated Data
* Module 5 Data
* Module 6 Data
Results from Instructor’s Instance on Amazon
* Module 2 result
* Module 5 result
* Module 6 result
Tools with installation instructions on our Amazon server
Instructions for installing the tools used in the workshops can be found here.
Launching CBW AMI
Steps to launch CBW public AMI
- AMI ID: ami-b9a253d2
- AMI Name: CBW workshops 2015